gliner.serve.config module¶

Configuration for GLiNER Ray Serve deployment.

class gliner.serve.config.GLiNERServeConfig(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)[source]¶

Bases: object

Configuration for GLiNER Ray Serve deployment.

This config controls model loading, serving parameters, and dynamic batching behavior. Aligned with GLiNEREngineConfig from engine module.

model: str¶
device: str = 'cuda'¶
dtype: str = 'bfloat16'¶
quantization: str | None = None¶
max_model_len: int = 2048¶
max_span_width: int = 12¶
max_labels: int = -1¶
default_threshold: float = 0.5¶
default_relation_threshold: float = 0.5¶
num_replicas: int = 1¶
num_gpus_per_replica: float = 1.0¶
num_cpus_per_replica: float = 1.0¶
max_batch_size: int = 32¶
batch_wait_timeout_ms: float = 5.0¶
request_timeout_s: float = 30.0¶
max_ongoing_requests: int = 256¶
queue_capacity: int = 4096¶
route_prefix: str = '/gliner'¶
tokenizer_threads: int = 4¶
decoding_threads: int = 4¶
enable_compilation: bool = True¶
enable_sequence_packing: bool = False¶
enable_flashdeberta: bool = False¶
precompiled_batch_sizes: List[int]¶
target_memory_fraction: float = 0.8¶
memory_overhead_factor: float = 1.3¶
calibration_min_seq_len: int = 64¶
calibration_probe_batch_size: int = 2¶
warmup_iterations: int = 3¶
http_port: int = 8000¶
ray_address: str | None = None¶
to_env_vars()[source]¶

Convert config to environment variables for model loading.

__init__(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)¶