gliner.serve.server module¶
Ray Serve deployment for GLiNER with dynamic batching and memory-aware batch sizing.
- class gliner.serve.server.GLiNERServer(config)[source]¶
Bases:
objectGLiNER Ray Serve deployment with dynamic batching.
Supports both entity extraction (NER) and relation extraction. Automatically detects model type and adjusts behavior accordingly.
Uses low-level batch methods (prepare_batch, collate_batch, run_batch, decode_batch) to avoid DataLoader initialization overhead on each call.
- Features:
Dynamic batching with Ray Serve’s @serve.batch
Memory-aware batch size estimation to prevent CUDA OOM
Precompilation for power-of-two batch sizes
Support for both NER and relation extraction models
FlashDeBERTa support for faster inference
Sequence packing for improved throughput
Initialize the GLiNER server deployment.
- Parameters:
config (GLiNERServeConfig) – Server configuration with model and serving parameters.
- __init__(config)[source]¶
Initialize the GLiNER server deployment.
- Parameters:
config (GLiNERServeConfig) – Server configuration with model and serving parameters.
- batch_size_fn(seq_len=None)[source]¶
Largest precompiled batch size that fits at
seq_len.With no arguments, returns the worst-case answer (
max_model_len), suitable for the deployment’s initialmax_batch_size. Called again from_infer_batchwith the observed seq length (text + label + relation words) to re-size Ray’s batcher for the next accumulation.
- observed_seq_len(texts, labels=None, relations=None)[source]¶
Total input word count: longest text + all label/relation words.
Labels and relations are concatenated into the input by the model, so they extend the effective sequence length for every sample in the batch.
- predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶
Predict entities and optionally relations.
- Parameters:
texts (str | List[str]) – Input text(s) to process.
labels (List[str]) – Entity type labels to extract.
relations (List[str] | None) – Relation type labels (only for relex models).
threshold (float | None) – Confidence threshold for entities.
relation_threshold (float | None) – Confidence threshold for relations.
flat_ner (bool) – Whether to use flat NER (no overlapping entities).
multi_label (bool) – Whether to allow multiple labels per span.
- Returns:
“entities”: List of entity dicts with start, end, text, label, score
”relations”: List of relation dicts (only if model supports relations)
- Return type:
List of result dicts, one per input text. Each dict contains
- gliner.serve.server.serve(config, blocking=False)[source]¶
Start GLiNER Ray Serve deployment.
- Parameters:
config (GLiNERServeConfig) – Server configuration.
blocking (bool) – If True, blocks until the server is shut down.
- Returns:
Ray Serve deployment handle for making predictions.
- Return type:
Any
Example
>>> from gliner.serve import GLiNERServeConfig, serve >>> config = GLiNERServeConfig(model="urchade/gliner_small-v2.1") >>> handle = serve(config) >>> # Make predictions >>> ref = handle.predict.remote("John works at Google", ["person", "org"]) >>> print(ref.result())
- class gliner.serve.server.GLiNERFactory(model=None, *, config=None, **kwargs)[source]¶
Bases:
objectvLLM-style synchronous facade over a GLiNER Ray Serve deployment.
Bundles config → deploy → client into one lifecycle-managed object so callers never see Ray’s ObjectRefs.
Pass a list of texts to
predictto preserve dynamic batching: each text is dispatched as a separate request so Ray Serve’s@serve.batchcan accumulate them into a single forward pass. A Python loop of single-text calls would serialize and defeat batching.Example
>>> from gliner.serve import GLiNERFactory >>> llm = GLiNERFactory(model="urchade/gliner_small-v2.1") >>> outputs = llm.predict( ... ["John works at Google", "Paris is in France"], ... labels=["person", "organization", "location"], ... ) >>> llm.shutdown()
Or as a context manager: >>> with GLiNERFactory(model=”urchade/gliner_small-v2.1”) as llm: … out = llm.predict(“John works at Google”, [“person”, “org”])
Build a config (if not provided) and start the Ray Serve deployment.
- Parameters:
model (str | None) – Model name or path. Ignored if
configis provided.config (GLiNERServeConfig | None) – Prebuilt
GLiNERServeConfig. Mutually exclusive withmodel/kwargs.**kwargs – Forwarded to
GLiNERServeConfigwhen building one.
- __init__(model=None, *, config=None, **kwargs)[source]¶
Build a config (if not provided) and start the Ray Serve deployment.
- Parameters:
model (str | None) – Model name or path. Ignored if
configis provided.config (GLiNERServeConfig | None) – Prebuilt
GLiNERServeConfig. Mutually exclusive withmodel/kwargs.**kwargs – Forwarded to
GLiNERServeConfigwhen building one.
- property handle¶
Underlying Ray Serve deployment handle — for async/advanced use.
- predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶
Blocking prediction. Returns a dict for
strinput, list for list input.
- async predict_async(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶
Async prediction. Concurrent calls accumulate into one batch.
- shutdown()[source]¶
Tear down the Ray Serve deployment and the Ray runtime it booted.
Idempotent. Shutting down Ray after Serve avoids leaving the driver attached to a detached Serve instance — the latter produces noisy
ServeController ... killed by ray.killretry warnings in the raylet log when the process exits.