gliner.serve.server module¶

Ray Serve deployment for GLiNER with dynamic batching and memory-aware batch sizing.

class gliner.serve.server.GLiNERServer(config)[source]¶

Bases: object

GLiNER Ray Serve deployment with dynamic batching.

Supports both entity extraction (NER) and relation extraction. Automatically detects model type and adjusts behavior accordingly.

Uses low-level batch methods (prepare_batch, collate_batch, run_batch, decode_batch) to avoid DataLoader initialization overhead on each call.

Features:
  • Dynamic batching with Ray Serve’s @serve.batch

  • Memory-aware batch size estimation to prevent CUDA OOM

  • Precompilation for power-of-two batch sizes

  • Support for both NER and relation extraction models

  • FlashDeBERTa support for faster inference

  • Sequence packing for improved throughput

Initialize the GLiNER server deployment.

Parameters:

config (GLiNERServeConfig) – Server configuration with model and serving parameters.

__init__(config)[source]¶

Initialize the GLiNER server deployment.

Parameters:

config (GLiNERServeConfig) – Server configuration with model and serving parameters.

batch_size_fn(seq_len=None)[source]¶

Largest precompiled batch size that fits at seq_len.

With no arguments, returns the worst-case answer (max_model_len), suitable for the deployment’s initial max_batch_size. Called again from _infer_batch with the observed seq length (text + label + relation words) to re-size Ray’s batcher for the next accumulation.

observed_seq_len(texts, labels=None, relations=None)[source]¶

Total input word count: longest text + all label/relation words.

Labels and relations are concatenated into the input by the model, so they extend the effective sequence length for every sample in the batch.

predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Predict entities and optionally relations.

Parameters:
  • texts (str | List[str]) – Input text(s) to process.

  • labels (List[str]) – Entity type labels to extract.

  • relations (List[str] | None) – Relation type labels (only for relex models).

  • threshold (float | None) – Confidence threshold for entities.

  • relation_threshold (float | None) – Confidence threshold for relations.

  • flat_ner (bool) – Whether to use flat NER (no overlapping entities).

  • multi_label (bool) – Whether to allow multiple labels per span.

Returns:

  • “entities”: List of entity dicts with start, end, text, label, score

  • ”relations”: List of relation dicts (only if model supports relations)

Return type:

List of result dicts, one per input text. Each dict contains

gliner.serve.server.serve(config, blocking=False)[source]¶

Start GLiNER Ray Serve deployment.

Parameters:
  • config (GLiNERServeConfig) – Server configuration.

  • blocking (bool) – If True, blocks until the server is shut down.

Returns:

Ray Serve deployment handle for making predictions.

Return type:

Any

Example

>>> from gliner.serve import GLiNERServeConfig, serve
>>> config = GLiNERServeConfig(model="urchade/gliner_small-v2.1")
>>> handle = serve(config)
>>> # Make predictions
>>> ref = handle.predict.remote("John works at Google", ["person", "org"])
>>> print(ref.result())
gliner.serve.server.shutdown()[source]¶

Shutdown the GLiNER Ray Serve deployment.

class gliner.serve.server.GLiNERFactory(model=None, *, config=None, **kwargs)[source]¶

Bases: object

vLLM-style synchronous facade over a GLiNER Ray Serve deployment.

Bundles config → deploy → client into one lifecycle-managed object so callers never see Ray’s ObjectRefs.

Pass a list of texts to predict to preserve dynamic batching: each text is dispatched as a separate request so Ray Serve’s @serve.batch can accumulate them into a single forward pass. A Python loop of single-text calls would serialize and defeat batching.

Example

>>> from gliner.serve import GLiNERFactory
>>> llm = GLiNERFactory(model="urchade/gliner_small-v2.1")
>>> outputs = llm.predict(
...     ["John works at Google", "Paris is in France"],
...     labels=["person", "organization", "location"],
... )
>>> llm.shutdown()

Or as a context manager: >>> with GLiNERFactory(model=”urchade/gliner_small-v2.1”) as llm: … out = llm.predict(“John works at Google”, [“person”, “org”])

Build a config (if not provided) and start the Ray Serve deployment.

Parameters:
  • model (str | None) – Model name or path. Ignored if config is provided.

  • config (GLiNERServeConfig | None) – Prebuilt GLiNERServeConfig. Mutually exclusive with model/kwargs.

  • **kwargs – Forwarded to GLiNERServeConfig when building one.

__init__(model=None, *, config=None, **kwargs)[source]¶

Build a config (if not provided) and start the Ray Serve deployment.

Parameters:
  • model (str | None) – Model name or path. Ignored if config is provided.

  • config (GLiNERServeConfig | None) – Prebuilt GLiNERServeConfig. Mutually exclusive with model/kwargs.

  • **kwargs – Forwarded to GLiNERServeConfig when building one.

property handle¶

Underlying Ray Serve deployment handle — for async/advanced use.

predict(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Blocking prediction. Returns a dict for str input, list for list input.

async predict_async(texts, labels, relations=None, threshold=None, relation_threshold=None, flat_ner=True, multi_label=False)[source]¶

Async prediction. Concurrent calls accumulate into one batch.

shutdown()[source]¶

Tear down the Ray Serve deployment and the Ray runtime it booted.

Idempotent. Shutting down Ray after Serve avoids leaving the driver attached to a detached Serve instance — the latter produces noisy ServeController ... killed by ray.kill retry warnings in the raylet log when the process exits.