gliner.infer_packing module¶
Utilities for inference-time sequence packing.
This module provides helpers to group many short sequences into a single (or a few) contiguous token streams in order to reduce the amount of padding the encoder needs to process. Packed batches keep a block-diagonal attention mask so tokens from different original sequences cannot attend to each other. After the encoder forward pass, results can be unpacked back to the original request ordering.
- class gliner.infer_packing.InferencePackingConfig(max_length, sep_token_id=None, streams_per_batch=1)[source]¶
Bases:
objectConfiguration describing how sequences should be packed.
- max_length¶
Maximum number of tokens allowed in a packed stream.
- Type:
int
- sep_token_id¶
Optional separator token ID to insert between sequences. Currently not used in the implementation.
- Type:
int | None
- streams_per_batch¶
Number of streams to create per batch. Must be >= 1.
- Type:
int
- max_length: int¶
- sep_token_id: int | None = None¶
- streams_per_batch: int = 1¶
- __init__(max_length, sep_token_id=None, streams_per_batch=1)¶
- class gliner.infer_packing.PackedBatch(input_ids, attention_mask, pair_attention_mask, segment_ids, map_out, offsets, lengths)[source]¶
Bases:
objectContainer describing a packed collection of requests.
- input_ids¶
Tensor of shape (num_streams, max_len) containing packed token IDs.
- Type:
torch.LongTensor
- attention_mask¶
Tensor of shape (num_streams, max_len) with 1s for valid tokens and 0s for padding.
- Type:
torch.LongTensor
- pair_attention_mask¶
Boolean tensor of shape (num_streams, max_len, max_len) representing block-diagonal attention mask.
- Type:
torch.BoolTensor
- segment_ids¶
Tensor of shape (num_streams, max_len) with unique IDs for each packed segment within a stream.
- Type:
torch.LongTensor
- map_out¶
List of lists mapping each segment in each stream back to its original request index.
- Type:
List[List[int]]
- offsets¶
List of lists containing the starting offset of each segment within each stream.
- Type:
List[List[int]]
- lengths¶
List of lists containing the length of each segment within each stream.
- Type:
List[List[int]]
- input_ids: LongTensor¶
- attention_mask: LongTensor¶
- pair_attention_mask: BoolTensor¶
- segment_ids: LongTensor¶
- map_out: List[List[int]]¶
- offsets: List[List[int]]¶
- lengths: List[List[int]]¶
- __init__(input_ids, attention_mask, pair_attention_mask, segment_ids, map_out, offsets, lengths)¶
- gliner.infer_packing.block_diag_mask(segment_ids)[source]¶
Construct a block diagonal mask from per-token segment ids.
Creates a boolean attention mask where tokens can only attend to other tokens with the same segment ID. This prevents cross-contamination between different sequences packed into the same stream.
- Parameters:
segment_ids (LongTensor) – Tensor of shape (batch_size, seq_len) containing segment IDs for each token position.
- Returns:
Boolean tensor of shape (batch_size, seq_len, seq_len) where mask[b, i, j] is True if tokens i and j belong to the same segment in batch b.
- Return type:
BoolTensor
- gliner.infer_packing.pack_requests(requests, cfg, pad_token_id)[source]¶
Pack a collection of requests into one or more streams.
Groups multiple short sequences into contiguous token streams to reduce padding overhead. Each request’s tokens are placed into streams using a first-fit strategy. A block-diagonal attention mask ensures tokens from different requests cannot attend to each other.
- Parameters:
requests (List[Dict[str, Any]]) – List of request dictionaries. Each must contain an ‘input_ids’ key with a sequence of token IDs.
cfg (InferencePackingConfig) – Configuration specifying packing parameters (max_length, etc.).
pad_token_id (int) – Token ID to use for padding positions.
- Returns:
PackedBatch object containing packed tensors and metadata needed to unpack results back to original request ordering.
- Raises:
ValueError – If requests list is empty or configuration is invalid.
KeyError – If any request is missing required ‘input_ids’ key.
- Return type:
Example
>>> requests = [ ... {"input_ids": [1, 2, 3]}, ... {"input_ids": [4, 5]}, ... ] >>> cfg = InferencePackingConfig(max_length=10) >>> batch = pack_requests(requests, cfg, pad_token_id=0)
- gliner.infer_packing.unpack_spans(per_token_outputs, packed)[source]¶
Unpack encoder outputs back to the original request layout.
Takes per-token outputs from a packed batch and redistributes them back to match the original request ordering. Handles requests that were split across multiple streams by concatenating their segments.
- Parameters:
per_token_outputs (Any) – Tensor or array of shape (num_streams, max_len, …) containing per-token outputs from the encoder.
packed (PackedBatch) – PackedBatch object containing metadata about how requests were packed (from pack_requests).
- Returns:
List of tensors or arrays (one per original request) containing the unpacked outputs. If input was a NumPy array, outputs will be NumPy arrays; if PyTorch tensor, outputs will be PyTorch tensors.
- Raises:
ValueError – If per_token_outputs is not at least 2-dimensional.
TypeError – If per_token_outputs is neither a PyTorch tensor nor NumPy array.
- Return type:
List[Any]