gliner.serve.config module¶
Configuration for GLiNER Ray Serve deployment.
- class gliner.serve.config.GLiNERServeConfig(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)[source]¶
Bases:
objectConfiguration for GLiNER Ray Serve deployment.
This config controls model loading, serving parameters, and dynamic batching behavior. Aligned with GLiNEREngineConfig from engine module.
- model: str¶
- device: str = 'cuda'¶
- dtype: str = 'bfloat16'¶
- quantization: str | None = None¶
- max_model_len: int = 2048¶
- max_span_width: int = 12¶
- max_labels: int = -1¶
- default_threshold: float = 0.5¶
- default_relation_threshold: float = 0.5¶
- num_replicas: int = 1¶
- num_gpus_per_replica: float = 1.0¶
- num_cpus_per_replica: float = 1.0¶
- max_batch_size: int = 32¶
- batch_wait_timeout_ms: float = 5.0¶
- request_timeout_s: float = 30.0¶
- max_ongoing_requests: int = 256¶
- queue_capacity: int = 4096¶
- route_prefix: str = '/gliner'¶
- tokenizer_threads: int = 4¶
- decoding_threads: int = 4¶
- enable_compilation: bool = True¶
- enable_sequence_packing: bool = False¶
- enable_flashdeberta: bool = False¶
- precompiled_batch_sizes: List[int]¶
- target_memory_fraction: float = 0.8¶
- memory_overhead_factor: float = 1.3¶
- calibration_min_seq_len: int = 64¶
- calibration_probe_batch_size: int = 2¶
- warmup_iterations: int = 3¶
- http_port: int = 8000¶
- ray_address: str | None = None¶
- __init__(model, device='cuda', dtype='bfloat16', quantization=None, max_model_len=2048, max_span_width=12, max_labels=-1, default_threshold=0.5, default_relation_threshold=0.5, num_replicas=1, num_gpus_per_replica=1.0, num_cpus_per_replica=1.0, max_batch_size=32, batch_wait_timeout_ms=5.0, request_timeout_s=30.0, max_ongoing_requests=256, queue_capacity=4096, route_prefix='/gliner', tokenizer_threads=4, decoding_threads=4, enable_compilation=True, enable_sequence_packing=False, enable_flashdeberta=False, precompiled_batch_sizes=<factory>, target_memory_fraction=0.8, memory_overhead_factor=1.3, calibration_min_seq_len=64, calibration_probe_batch_size=2, warmup_iterations=3, http_port=8000, ray_address=None)¶