gliner.serve.memory module¶

Memory estimation for GLiNER via precomputed calibration table.

Startup calibration runs the model on probe batches at power-of-two sequence lengths and records peak GPU memory per sample. At request time batch_size_fn picks the largest precompiled batch size that satisfies

per_sample(seq_len) * N <= total_gpu - cuda_context - model_weights

using a pessimistic (rounded-up) seq_len and a safety factor on per-sample memory. Labels and relations are NOT scaled as a separate dimension — they are part of the model input, so callers must include their word count in seq_len when invoking batch_size_fn.

class gliner.serve.memory.GLiNERMemoryEstimator(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]¶

Bases: object

Precomputed memory table for GLiNER inference.

__init__(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]¶
measure_cuda_context()[source]¶

Record CUDA context overhead. Must be called before the model loads.

measure_model_memory()[source]¶

Record model weight memory. Must be called after the model loads.

available_memory()[source]¶

Budget for a batch: total_gpu - cuda_context - model_weights.

calibrate(batch_method, max_seq_len, min_seq_len=64)[source]¶

Populate per_sample_table across power-of-two seq lengths.

Uses a single dummy label so the probed sequence length is dominated by text tokens; label/relation words are accounted for at lookup time by the caller extending seq_len.

per_sample_at(seq_len)[source]¶

Pessimistic per-sample memory at or above seq_len.

batch_size_fn(seq_len, precompiled_sizes)[source]¶

Largest precompiled batch size satisfying per_sample * N <= budget.

Budget = total_gpu - cuda_context - model_weights (times the configured target_memory_fraction). The caller is responsible for folding label / relation word counts into seq_len.