gliner.serve.config module¶

Configuration for GLiNER Ray Serve deployment.

class gliner.serve.config.GLiNERServeConfig(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)[source]¶

Bases: object

Configuration for GLiNER Ray Serve deployment.

This config controls model loading, serving parameters, and dynamic batching behavior. Aligned with GLiNEREngineConfig from engine module.

model: str¶

device: str = 'cuda'¶

dtype: str = 'bfloat16'¶

quantization: str | None = None¶

max_model_len: int = 2048¶

max_span_width: int = 12¶

max_labels: int = -1¶

default_threshold: float = 0.5¶

default_relation_threshold: float = 0.5¶

num_replicas: int = 1¶

num_gpus_per_replica: float = 1.0¶

num_cpus_per_replica: float = 1.0¶

max_batch_size: int = 32¶

batch_wait_timeout_ms: float = 5.0¶

request_timeout_s: float = 30.0¶

max_ongoing_requests: int = 256¶

queue_capacity: int = 4096¶

route_prefix: str = '/gliner'¶

tokenizer_threads: int = 4¶

decoding_threads: int = 4¶

enable_compilation: bool = True¶

enable_sequence_packing: bool = False¶

enable_flashdeberta: bool = False¶

precompiled_batch_sizes: List[int]¶

target_memory_fraction: float = 0.8¶

memory_overhead_factor: float = 1.3¶

calibration_min_seq_len: int = 64¶

calibration_probe_batch_size: int = 2¶

warmup_iterations: int = 3¶

http_port: int = 8000¶

ray_address: str | None = None¶

to_env_vars()[source]¶

Convert config to environment variables for model loading.

__init__(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)¶