gliner.serve package¶

GLiNER Ray Serve module for production deployment.

Quick Start:

# Start server (CLI) python -m gliner.serve –model urchade/gliner_small-v2.1

# Make predictions (Python) from gliner.serve import GLiNERClient client = GLiNERClient() result = client.predict(“John works at Google”, [“person”, “organization”])

# Or programmatically start server from gliner.serve import GLiNERServeConfig, serve config = GLiNERServeConfig(model=”urchade/gliner_small-v2.1”) handle = serve(config)

Features:
  • Dynamic batching via Ray Serve

  • Memory-aware batch sizing (prevents CUDA OOM)

  • Precompiled power-of-two batch sizes

  • NER and relation extraction support

  • FlashDeBERTa and sequence packing

See docs/serving.md for full documentation.

class gliner.serve.GLiNERClient(base_url='http://localhost:8000', route_prefix='/gliner', timeout=30.0, max_concurrency=32)[source]¶

Bases: object

HTTP client for a running GLiNER Ray Serve deployment.

Example

>>> from gliner.serve import GLiNERClient
>>> client = GLiNERClient()
>>> results = client.predict(
...     "John works at Google in Mountain View", labels=["person", "organization", "location"]
... )
{'entities': [{'start': 0, 'end': 4, 'text': 'John', 'label': 'person', ...}, ...]}

Initialize the HTTP client.

Parameters:
  • base_url (str) – Scheme + host + port of the Ray Serve HTTP proxy.

  • route_prefix (str) – Route prefix the deployment is mounted under (must match GLiNERServeConfig.route_prefix).

  • timeout (float) – Per-request timeout in seconds.

  • max_concurrency (int) – Maximum in-flight HTTP requests when predicting on a list of texts. Bounds the client-side thread pool.

__init__(base_url='http://localhost:8000', route_prefix='/gliner', timeout=30.0, max_concurrency=32)[source]¶

Initialize the HTTP client.

Parameters:
  • base_url (str) – Scheme + host + port of the Ray Serve HTTP proxy.

  • route_prefix (str) – Route prefix the deployment is mounted under (must match GLiNERServeConfig.route_prefix).

  • timeout (float) – Per-request timeout in seconds.

  • max_concurrency (int) – Maximum in-flight HTTP requests when predicting on a list of texts. Bounds the client-side thread pool.

predict(text, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Blocking prediction. str in -> dict out; list in -> list out.

async predict_async(text, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Async version of predict.

class gliner.serve.GLiNERFactory(model=None, *, config=None, **kwargs)[source]¶

Bases: object

vLLM-style synchronous facade over a GLiNER Ray Serve deployment.

Bundles config → deploy → client into one lifecycle-managed object so callers never see Ray’s ObjectRefs.

Pass a list of texts to predict to preserve dynamic batching: each text is dispatched as a separate request so Ray Serve’s @serve.batch can accumulate them into a single forward pass. A Python loop of single-text calls would serialize and defeat batching.

Example

>>> from gliner.serve import GLiNERFactory
>>> llm = GLiNERFactory(model="urchade/gliner_small-v2.1")
>>> outputs = llm.predict(
...     ["John works at Google", "Paris is in France"],
...     labels=["person", "organization", "location"],
... )
>>> llm.shutdown()

Or as a context manager: >>> with GLiNERFactory(model=”urchade/gliner_small-v2.1”) as llm: … out = llm.predict(“John works at Google”, [“person”, “org”])

Build a config (if not provided) and start the Ray Serve deployment.

Parameters:
  • model (str | None) – Model name or path. Ignored if config is provided.

  • config (GLiNERServeConfig | None) – Prebuilt GLiNERServeConfig. Mutually exclusive with model/kwargs.

  • **kwargs – Forwarded to GLiNERServeConfig when building one.

__init__(model=None, *, config=None, **kwargs)[source]¶

Build a config (if not provided) and start the Ray Serve deployment.

Parameters:
  • model (str | None) – Model name or path. Ignored if config is provided.

  • config (GLiNERServeConfig | None) – Prebuilt GLiNERServeConfig. Mutually exclusive with model/kwargs.

  • **kwargs – Forwarded to GLiNERServeConfig when building one.

property handle¶

Underlying Ray Serve deployment handle — for async/advanced use.

predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Blocking prediction. Returns a dict for str input, list for list input.

async predict_async(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Async prediction. Concurrent calls accumulate into one batch.

shutdown()[source]¶

Tear down the Ray Serve deployment and the Ray runtime it booted.

Idempotent. Shutting down Ray after Serve avoids leaving the driver attached to a detached Serve instance — the latter produces noisy ServeController ... killed by ray.kill retry warnings in the raylet log when the process exits.

class gliner.serve.GLiNERMemoryEstimator(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]¶

Bases: object

Precomputed memory table for GLiNER inference.

__init__(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]¶
measure_cuda_context()[source]¶

Record CUDA context overhead. Must be called before the model loads.

measure_model_memory()[source]¶

Record model weight memory. Must be called after the model loads.

available_memory()[source]¶

Budget for a batch: total_gpu - cuda_context - model_weights.

calibrate(batch_method, max_seq_len, min_seq_len=64)[source]¶

Populate per_sample_table across power-of-two seq lengths.

Uses a single dummy label so the probed sequence length is dominated by text tokens; label/relation words are accounted for at lookup time by the caller extending seq_len.

per_sample_at(seq_len)[source]¶

Pessimistic per-sample memory at or above seq_len.

batch_size_fn(seq_len, precompiled_sizes)[source]¶

Largest precompiled batch size satisfying per_sample * N <= budget.

Budget = total_gpu - cuda_context - model_weights (times the configured target_memory_fraction). The caller is responsible for folding label / relation word counts into seq_len.

class gliner.serve.GLiNERServeConfig(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)[source]¶

Bases: object

Configuration for GLiNER Ray Serve deployment.

This config controls model loading, serving parameters, and dynamic batching behavior. Aligned with GLiNEREngineConfig from engine module.

model: str¶
device: str = 'cuda'¶
dtype: str = 'bfloat16'¶
quantization: str | None = None¶
max_model_len: int = 2048¶
max_span_width: int = 12¶
max_labels: int = -1¶
default_threshold: float = 0.5¶
default_relation_threshold: float = 0.5¶
num_replicas: int = 1¶
num_gpus_per_replica: float = 1.0¶
num_cpus_per_replica: float = 1.0¶
max_batch_size: int = 32¶
batch_wait_timeout_ms: float = 5.0¶
request_timeout_s: float = 30.0¶
max_ongoing_requests: int = 256¶
queue_capacity: int = 4096¶
route_prefix: str = '/gliner'¶
tokenizer_threads: int = 4¶
decoding_threads: int = 4¶
enable_compilation: bool = True¶
enable_sequence_packing: bool = False¶
enable_flashdeberta: bool = False¶
precompiled_batch_sizes: List[int]¶
target_memory_fraction: float = 0.8¶
memory_overhead_factor: float = 1.3¶
calibration_min_seq_len: int = 64¶
calibration_probe_batch_size: int = 2¶
warmup_iterations: int = 3¶
http_port: int = 8000¶
ray_address: str | None = None¶
to_env_vars()[source]¶

Convert config to environment variables for model loading.

__init__(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)¶
class gliner.serve.GLiNERServer(config)[source]¶

Bases: object

GLiNER Ray Serve deployment with dynamic batching.

Supports both entity extraction (NER) and relation extraction. Automatically detects model type and adjusts behavior accordingly.

Uses low-level batch methods (prepare_batch, collate_batch, run_batch, decode_batch) to avoid DataLoader initialization overhead on each call.

Features:
  • Dynamic batching with Ray Serve’s @serve.batch

  • Memory-aware batch size estimation to prevent CUDA OOM

  • Precompilation for power-of-two batch sizes

  • Support for both NER and relation extraction models

  • FlashDeBERTa support for faster inference

  • Sequence packing for improved throughput

Initialize the GLiNER server deployment.

Parameters:

config (GLiNERServeConfig) – Server configuration with model and serving parameters.

__init__(config)[source]¶

Initialize the GLiNER server deployment.

Parameters:

config (GLiNERServeConfig) – Server configuration with model and serving parameters.

batch_size_fn(seq_len=None)[source]¶

Largest precompiled batch size that fits at seq_len.

With no arguments, returns the worst-case answer (max_model_len), suitable for the deployment’s initial max_batch_size. Called again from _infer_batch with the observed seq length (text + label + relation words) to re-size Ray’s batcher for the next accumulation.

observed_seq_len(texts, labels=None, relations=None)[source]¶

Total input word count: longest text + all label/relation words.

Labels and relations are concatenated into the input by the model, so they extend the effective sequence length for every sample in the batch.

predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Predict entities and optionally relations.

Parameters:
  • texts (str | List[str]) – Input text(s) to process.

  • labels (List[str]) – Entity type labels to extract.

  • relations (List[str] | None) – Relation type labels (only for relex models).

  • threshold (float | None) – Confidence threshold for entities.

  • relation_threshold (float | None) – Confidence threshold for relations.

  • flat_ner (bool) – Whether to use flat NER (no overlapping entities).

  • multi_label (bool) – Whether to allow multiple labels per span.

Returns:

  • “entities”: List of entity dicts with start, end, text, label, score

  • ”relations”: List of relation dicts (only if model supports relations)

Return type:

List of result dicts, one per input text. Each dict contains

gliner.serve.get_client(base_url='http://localhost:8000', route_prefix='/gliner', timeout=30.0, max_concurrency=32)[source]¶

Convenience constructor for GLiNERClient.

gliner.serve.serve(config, blocking=False)[source]¶

Start GLiNER Ray Serve deployment.

Parameters:
  • config (GLiNERServeConfig) – Server configuration.

  • blocking (bool) – If True, blocks until the server is shut down.

Returns:

Ray Serve deployment handle for making predictions.

Return type:

Any

Example

>>> from gliner.serve import GLiNERServeConfig, serve
>>> config = GLiNERServeConfig(model="urchade/gliner_small-v2.1")
>>> handle = serve(config)
>>> # Make predictions
>>> ref = handle.predict.remote("John works at Google", ["person", "org"])
>>> print(ref.result())
gliner.serve.shutdown()[source]¶

Shutdown the GLiNER Ray Serve deployment.

Submodules¶