gliner.serve.memory module
Memory estimation for GLiNER via precomputed calibration table.
Startup calibration runs the model on probe batches at power-of-two sequence
lengths and records peak GPU memory per sample. At request time batch_size_fn
picks the largest precompiled batch size that satisfies
per_sample(seq_len) * N <= total_gpu - cuda_context - model_weights
using a pessimistic (rounded-up) seq_len and a safety factor on per-sample
memory. Labels and relations are NOT scaled as a separate dimension — they are
part of the model input, so callers must include their word count in
seq_len when invoking batch_size_fn.
-
class gliner.serve.memory.GLiNERMemoryEstimator(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]
Bases: object
Precomputed memory table for GLiNER inference.
-
__init__(safety_factor=1.3, target_memory_fraction=0.9, calibration_probe_batch_size=2)[source]
-
measure_cuda_context()[source]
Record CUDA context overhead. Must be called before the model loads.
-
measure_model_memory()[source]
Record model weight memory. Must be called after the model loads.
-
available_memory()[source]
Budget for a batch: total_gpu - cuda_context - model_weights.
-
calibrate(batch_method, max_seq_len, min_seq_len=64)[source]
Populate per_sample_table across power-of-two seq lengths.
Uses a single dummy label so the probed sequence length is dominated
by text tokens; label/relation words are accounted for at lookup time
by the caller extending seq_len.
-
per_sample_at(seq_len)[source]
Pessimistic per-sample memory at or above seq_len.
-
batch_size_fn(seq_len, precompiled_sizes)[source]
Largest precompiled batch size satisfying per_sample * N <= budget.
Budget = total_gpu - cuda_context - model_weights (times the
configured target_memory_fraction). The caller is responsible for
folding label / relation word counts into seq_len.