gliner.serve package¶
GLiNER Ray Serve module for production deployment.
Quick Start:
# Start server (CLI) python -m gliner.serve –model urchade/gliner_small-v2.1
# Make predictions (Python) from gliner.serve import GLiNERClient client = GLiNERClient() result = client.predict(“John works at Google”, [“person”, “organization”])
# Or programmatically start server from gliner.serve import GLiNERServeConfig, serve config = GLiNERServeConfig(model=”urchade/gliner_small-v2.1”) handle = serve(config)
- Features:
Dynamic batching via Ray Serve
Memory-aware batch sizing (prevents CUDA OOM)
Precompiled power-of-two batch sizes
NER and relation extraction support
FlashDeBERTa and sequence packing
See docs/serving.md for full documentation.
- class gliner.serve.GLiNERClient(base_url='http://localhost:8000', route_prefix='/gliner', timeout=30.0, max_concurrency=32)[source]¶
Bases:
objectHTTP client for a running GLiNER Ray Serve deployment.
Example
>>> from gliner.serve import GLiNERClient >>> client = GLiNERClient() >>> results = client.predict( ... "John works at Google in Mountain View", labels=["person", "organization", "location"] ... ) {'entities': [{'start': 0, 'end': 4, 'text': 'John', 'label': 'person', ...}, ...]}
Initialize the HTTP client.
- Parameters:
base_url (str) – Scheme + host + port of the Ray Serve HTTP proxy.
route_prefix (str) – Route prefix the deployment is mounted under (must match
GLiNERServeConfig.route_prefix).timeout (float) – Per-request timeout in seconds.
max_concurrency (int) – Maximum in-flight HTTP requests when predicting on a list of texts. Bounds the client-side thread pool.
- __init__(base_url='http://localhost:8000', route_prefix='/gliner', timeout=30.0, max_concurrency=32)[source]¶
Initialize the HTTP client.
- Parameters:
base_url (str) – Scheme + host + port of the Ray Serve HTTP proxy.
route_prefix (str) – Route prefix the deployment is mounted under (must match
GLiNERServeConfig.route_prefix).timeout (float) – Per-request timeout in seconds.
max_concurrency (int) – Maximum in-flight HTTP requests when predicting on a list of texts. Bounds the client-side thread pool.
- class gliner.serve.GLiNERFactory(model=None, *, config=None, **kwargs)[source]¶
Bases:
objectvLLM-style synchronous facade over a GLiNER Ray Serve deployment.
Bundles config → deploy → client into one lifecycle-managed object so callers never see Ray’s ObjectRefs.
Pass a list of texts to
predictto preserve dynamic batching: each text is dispatched as a separate request so Ray Serve’s@serve.batchcan accumulate them into a single forward pass. A Python loop of single-text calls would serialize and defeat batching.Example
>>> from gliner.serve import GLiNERFactory >>> llm = GLiNERFactory(model="urchade/gliner_small-v2.1") >>> outputs = llm.predict( ... ["John works at Google", "Paris is in France"], ... labels=["person", "organization", "location"], ... ) >>> llm.shutdown()
Or as a context manager: >>> with GLiNERFactory(model=”urchade/gliner_small-v2.1”) as llm: … out = llm.predict(“John works at Google”, [“person”, “org”])
Build a config (if not provided) and start the Ray Serve deployment.
- Parameters:
model (str | None) – Model name or path. Ignored if
configis provided.config (GLiNERServeConfig | None) – Prebuilt
GLiNERServeConfig. Mutually exclusive withmodel/kwargs.**kwargs – Forwarded to
GLiNERServeConfigwhen building one.
- __init__(model=None, *, config=None, **kwargs)[source]¶
Build a config (if not provided) and start the Ray Serve deployment.
- Parameters:
model (str | None) – Model name or path. Ignored if
configis provided.config (GLiNERServeConfig | None) – Prebuilt
GLiNERServeConfig. Mutually exclusive withmodel/kwargs.**kwargs – Forwarded to
GLiNERServeConfigwhen building one.
- property handle¶
Underlying Ray Serve deployment handle — for async/advanced use.
- predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶
Blocking prediction. Returns a dict for
strinput, list for list input.
- async predict_async(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶
Async prediction. Concurrent calls accumulate into one batch.
- shutdown()[source]¶
Tear down the Ray Serve deployment and the Ray runtime it booted.
Idempotent. Shutting down Ray after Serve avoids leaving the driver attached to a detached Serve instance — the latter produces noisy
ServeController ... killed by ray.killretry warnings in the raylet log when the process exits.
- class gliner.serve.GLiNERMemoryEstimator(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]¶
Bases:
objectPrecomputed memory table for GLiNER inference.
- measure_cuda_context()[source]¶
Record CUDA context overhead. Must be called before the model loads.
- class gliner.serve.GLiNERServeConfig(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)[source]¶
Bases:
objectConfiguration for GLiNER Ray Serve deployment.
This config controls model loading, serving parameters, and dynamic batching behavior. Aligned with GLiNEREngineConfig from engine module.
- model: str¶
- device: str = 'cuda'¶
- dtype: str = 'bfloat16'¶
- quantization: str | None = None¶
- max_model_len: int = 2048¶
- max_span_width: int = 12¶
- max_labels: int = -1¶
- default_threshold: float = 0.5¶
- default_relation_threshold: float = 0.5¶
- num_replicas: int = 1¶
- num_gpus_per_replica: float = 1.0¶
- num_cpus_per_replica: float = 1.0¶
- max_batch_size: int = 32¶
- batch_wait_timeout_ms: float = 5.0¶
- request_timeout_s: float = 30.0¶
- max_ongoing_requests: int = 256¶
- queue_capacity: int = 4096¶
- route_prefix: str = '/gliner'¶
- tokenizer_threads: int = 4¶
- decoding_threads: int = 4¶
- enable_compilation: bool = True¶
- enable_sequence_packing: bool = False¶
- enable_flashdeberta: bool = False¶
- precompiled_batch_sizes: List[int]¶
- target_memory_fraction: float = 0.8¶
- memory_overhead_factor: float = 1.3¶
- calibration_min_seq_len: int = 64¶
- calibration_probe_batch_size: int = 2¶
- warmup_iterations: int = 3¶
- http_port: int = 8000¶
- ray_address: str | None = None¶
- __init__(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)¶
- class gliner.serve.GLiNERServer(config)[source]¶
Bases:
objectGLiNER Ray Serve deployment with dynamic batching.
Supports both entity extraction (NER) and relation extraction. Automatically detects model type and adjusts behavior accordingly.
Uses low-level batch methods (prepare_batch, collate_batch, run_batch, decode_batch) to avoid DataLoader initialization overhead on each call.
- Features:
Dynamic batching with Ray Serve’s @serve.batch
Memory-aware batch size estimation to prevent CUDA OOM
Precompilation for power-of-two batch sizes
Support for both NER and relation extraction models
FlashDeBERTa support for faster inference
Sequence packing for improved throughput
Initialize the GLiNER server deployment.
- Parameters:
config (GLiNERServeConfig) – Server configuration with model and serving parameters.
- __init__(config)[source]¶
Initialize the GLiNER server deployment.
- Parameters:
config (GLiNERServeConfig) – Server configuration with model and serving parameters.
- batch_size_fn(seq_len=None)[source]¶
Largest precompiled batch size that fits at
seq_len.With no arguments, returns the worst-case answer (
max_model_len), suitable for the deployment’s initialmax_batch_size. Called again from_infer_batchwith the observed seq length (text + label + relation words) to re-size Ray’s batcher for the next accumulation.
- observed_seq_len(texts, labels=None, relations=None)[source]¶
Total input word count: longest text + all label/relation words.
Labels and relations are concatenated into the input by the model, so they extend the effective sequence length for every sample in the batch.
- predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶
Predict entities and optionally relations.
- Parameters:
texts (str | List[str]) – Input text(s) to process.
labels (List[str]) – Entity type labels to extract.
relations (List[str] | None) – Relation type labels (only for relex models).
threshold (float | None) – Confidence threshold for entities.
relation_threshold (float | None) – Confidence threshold for relations.
flat_ner (bool) – Whether to use flat NER (no overlapping entities).
multi_label (bool) – Whether to allow multiple labels per span.
- Returns:
“entities”: List of entity dicts with start, end, text, label, score
”relations”: List of relation dicts (only if model supports relations)
- Return type:
List of result dicts, one per input text. Each dict contains
- gliner.serve.get_client(base_url='http://localhost:8000', route_prefix='/gliner', timeout=30.0, max_concurrency=32)[source]¶
Convenience constructor for
GLiNERClient.
- gliner.serve.serve(config, blocking=False)[source]¶
Start GLiNER Ray Serve deployment.
- Parameters:
config (GLiNERServeConfig) – Server configuration.
blocking (bool) – If True, blocks until the server is shut down.
- Returns:
Ray Serve deployment handle for making predictions.
- Return type:
Any
Example
>>> from gliner.serve import GLiNERServeConfig, serve >>> config = GLiNERServeConfig(model="urchade/gliner_small-v2.1") >>> handle = serve(config) >>> # Make predictions >>> ref = handle.predict.remote("John works at Google", ["person", "org"]) >>> print(ref.result())