gliner.infer_packing module¶

Utilities for inference-time sequence packing.

This module provides helpers to group many short sequences into a single (or a few) contiguous token streams in order to reduce the amount of padding the encoder needs to process. Packed batches keep a block-diagonal attention mask so tokens from different original sequences cannot attend to each other. After the encoder forward pass, results can be unpacked back to the original request ordering.

class gliner.infer_packing.InferencePackingConfig(max_length, sep_token_id=None, streams_per_batch=1)[source]¶

Bases: object

Configuration describing how sequences should be packed.

max_length¶

Maximum number of tokens allowed in a packed stream.

Type:

int

sep_token_id¶

Optional separator token ID to insert between sequences. Currently not used in the implementation.

Type:

int | None

streams_per_batch¶

Number of streams to create per batch. Must be >= 1.

Type:

int

max_length: int¶
sep_token_id: int | None = None¶
streams_per_batch: int = 1¶
__init__(max_length, sep_token_id=None, streams_per_batch=1)¶
class gliner.infer_packing.PackedBatch(input_ids, attention_mask, pair_attention_mask, segment_ids, map_out, offsets, lengths)[source]¶

Bases: object

Container describing a packed collection of requests.

input_ids¶

Tensor of shape (num_streams, max_len) containing packed token IDs.

Type:

torch.LongTensor

attention_mask¶

Tensor of shape (num_streams, max_len) with 1s for valid tokens and 0s for padding.

Type:

torch.LongTensor

pair_attention_mask¶

Boolean tensor of shape (num_streams, max_len, max_len) representing block-diagonal attention mask.

Type:

torch.BoolTensor

segment_ids¶

Tensor of shape (num_streams, max_len) with unique IDs for each packed segment within a stream.

Type:

torch.LongTensor

map_out¶

List of lists mapping each segment in each stream back to its original request index.

Type:

List[List[int]]

offsets¶

List of lists containing the starting offset of each segment within each stream.

Type:

List[List[int]]

lengths¶

List of lists containing the length of each segment within each stream.

Type:

List[List[int]]

input_ids: LongTensor¶
attention_mask: LongTensor¶
pair_attention_mask: BoolTensor¶
segment_ids: LongTensor¶
map_out: List[List[int]]¶
offsets: List[List[int]]¶
lengths: List[List[int]]¶
__init__(input_ids, attention_mask, pair_attention_mask, segment_ids, map_out, offsets, lengths)¶
gliner.infer_packing.block_diag_mask(segment_ids)[source]¶

Construct a block diagonal mask from per-token segment ids.

Creates a boolean attention mask where tokens can only attend to other tokens with the same segment ID. This prevents cross-contamination between different sequences packed into the same stream.

Parameters:

segment_ids (LongTensor) – Tensor of shape (batch_size, seq_len) containing segment IDs for each token position.

Returns:

Boolean tensor of shape (batch_size, seq_len, seq_len) where mask[b, i, j] is True if tokens i and j belong to the same segment in batch b.

Return type:

BoolTensor

gliner.infer_packing.pack_requests(requests, cfg, pad_token_id)[source]¶

Pack a collection of requests into one or more streams.

Groups multiple short sequences into contiguous token streams to reduce padding overhead. Each request’s tokens are placed into streams using a first-fit strategy. A block-diagonal attention mask ensures tokens from different requests cannot attend to each other.

Parameters:
  • requests (List[Dict[str, Any]]) – List of request dictionaries. Each must contain an ‘input_ids’ key with a sequence of token IDs.

  • cfg (InferencePackingConfig) – Configuration specifying packing parameters (max_length, etc.).

  • pad_token_id (int) – Token ID to use for padding positions.

Returns:

PackedBatch object containing packed tensors and metadata needed to unpack results back to original request ordering.

Raises:
  • ValueError – If requests list is empty or configuration is invalid.

  • KeyError – If any request is missing required ‘input_ids’ key.

Return type:

PackedBatch

Example

>>> requests = [
...     {"input_ids": [1, 2, 3]},
...     {"input_ids": [4, 5]},
... ]
>>> cfg = InferencePackingConfig(max_length=10)
>>> batch = pack_requests(requests, cfg, pad_token_id=0)
gliner.infer_packing.unpack_spans(per_token_outputs, packed)[source]¶

Unpack encoder outputs back to the original request layout.

Takes per-token outputs from a packed batch and redistributes them back to match the original request ordering. Handles requests that were split across multiple streams by concatenating their segments.

Parameters:
  • per_token_outputs (Any) – Tensor or array of shape (num_streams, max_len, …) containing per-token outputs from the encoder.

  • packed (PackedBatch) – PackedBatch object containing metadata about how requests were packed (from pack_requests).

Returns:

List of tensors or arrays (one per original request) containing the unpacked outputs. If input was a NumPy array, outputs will be NumPy arrays; if PyTorch tensor, outputs will be PyTorch tensors.

Raises:
  • ValueError – If per_token_outputs is not at least 2-dimensional.

  • TypeError – If per_token_outputs is neither a PyTorch tensor nor NumPy array.

Return type:

List[Any]