gliner.data_processing.processor module¶

class gliner.data_processing.processor.BaseProcessor(config, tokenizer, words_splitter)[source]¶

Bases: ABC

Abstract base class for data processors.

This class provides the common interface and utilities for all processor implementations, handling tokenization, label preparation, and batch collation for NER and RE tasks.

Initialize the base processor.

Parameters:

config – Configuration object containing model and processing parameters.
tokenizer – Transformer tokenizer for subword tokenization.
words_splitter – Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.

__init__(config, tokenizer, words_splitter)[source]¶

Initialize the base processor.

Parameters:

config – Configuration object containing model and processing parameters.
tokenizer – Transformer tokenizer for subword tokenization.
words_splitter – Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.

static get_dict(spans, classes_to_id)[source]¶

Create a dictionary mapping spans to their class IDs.

Parameters:

spans (List[Tuple[int, int, str]]) – List of tuples (start, end, label) representing entity spans.
classes_to_id (Dict[str, int]) – Mapping from class labels to integer IDs.

Returns:

Dictionary mapping (start, end) tuples to class IDs.

Return type:

Dict[Tuple[int, int], int]

abstract preprocess_example(tokens, ner, classes_to_id)[source]¶

Preprocess a single example for model input.

Parameters:

tokens (List[str]) – List of token strings.
ner (List[Tuple[int, int, str]]) – List of NER annotations as (start, end, label) tuples.
classes_to_id (Dict[str, int]) – Mapping from class labels to integer IDs.

Returns:

Dictionary containing preprocessed example data.

Raises:

NotImplementedError – Must be implemented by subclasses.

Return type:

Dict

abstract create_labels()[source]¶

Create label tensors from batch data.

Returns:: Tensor containing labels for the batch.
Raises:: NotImplementedError – Must be implemented by subclasses.
Return type:: Tensor

abstract tokenize_and_prepare_labels()[source]¶

Tokenize inputs and prepare labels for a batch.

Raises:: NotImplementedError – Must be implemented by subclasses.

prepare_inputs(texts, entities, blank=None, add_entities=True, **kwargs)[source]¶

Prepare input texts with entity type prompts.

Prepends entity type special tokens that aggregates entity label information.

Parameters:

texts (Sequence[Sequence[str]]) – Sequences of token strings, one per example.
entities (Sequence[Sequence[str]] | Dict[int, Sequence[str]] | Sequence[str]) – Entity types to extract. Can be: - List of lists (per-example entity types) - Dictionary (shared entity types) - List of strings (same types for all examples)
blank (str | None) – Optional blank entity token for zero-shot scenarios.
add_entities (bool | None) – Whether to add entity text string to the prompt.
**kwargs – Additional keyword arguments.

Returns:

List of input text sequences with prepended prompts
List of prompt lengths for each example

Return type:

Tuple containing

prepare_word_mask(texts, tokenized_inputs, skip_first_words=None, token_level=False)[source]¶

Prepare word-level masks for tokenized inputs.

Creates masks that map subword tokens back to their original words.

Parameters:

texts – Original text sequences.
tokenized_inputs – Tokenized inputs from transformer tokenizer.
skip_first_words – Optional list of word counts to skip per example (e.g., prompt words).
token_level – If True, create token-level masks instead of word-level.

Returns:

Word mask array.

tokenize_inputs(texts, entities, blank=None, **kwargs)[source]¶

Tokenize input texts with entity prompts.

Parameters:

texts – Sequences of token strings.
entities – Entity types for extraction.
blank – Optional blank entity token.
**kwargs – Additional keyword arguments.

Returns:

input_ids: Token IDs
attention_mask: Attention mask
words_mask: Word-level mask

Return type:

Dictionary containing tokenized inputs with keys

batch_generate_class_mappings(batch_list, negatives=None, key='ner', sampled_neg=100)[source]¶

Generate class mappings for a batch with negative sampling.

Creates bidirectional mappings between class labels and integer IDs, with support for negative type sampling to improve model robustness.

Parameters:

batch_list (List[Dict]) – List of example dictionaries.
negatives (List[str] | None) – Optional pre-sampled negative types. If None, samples from batch.
key (str) – Key to access labels in batch dictionaries (default: ‘ner’).
sampled_neg (int) – Number of negative types to sample if negatives is None.

Returns:

List of class-to-ID mappings (one per example)
List of ID-to-class mappings (one per example)

Return type:

Tuple containing

collate_raw_batch(batch_list, entity_types=None, negatives=None, class_to_ids=None, id_to_classes=None, key='ner')[source]¶

Collate a raw batch with optional dynamic or provided label mappings.

Parameters:

batch_list (List[Dict]) – List of raw example dictionaries.
entity_types (List[str | List[str]] | None) – Optional predefined entity types. Can be a single list for all examples or list of lists for per-example types.
negatives (List[str] | None) – Optional list of negative entity types.
class_to_ids (Dict[str, int] | List[Dict[str, int]] | None) – Optional predefined class-to-ID mapping(s).
id_to_classes (Dict[int, str] | List[Dict[int, str]] | None) – Optional predefined ID-to-class mapping(s).
key – Key for accessing labels in batch (default: ‘ner’).

Returns:

Dictionary containing collated batch data ready for model input.

Return type:

Dict

collate_fn(batch, prepare_labels=True, *args, **kwargs)[source]¶

Collate function for DataLoader.

Parameters:

batch – Batch of examples from dataset.
prepare_labels – Whether to prepare labels (default: True).
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing model inputs and labels.

abstract create_batch_dict(batch, class_to_ids, id_to_classes)[source]¶

Create a batch dictionary from preprocessed examples.

Parameters:

batch (List[Dict]) – List of preprocessed example dictionaries.
class_to_ids (List[Dict[str, int]]) – List of class-to-ID mappings.
id_to_classes (List[Dict[int, str]]) – List of ID-to-class mappings.

Returns:

Dictionary containing collated batch tensors.

Raises:

NotImplementedError – Must be implemented by subclasses.

Return type:

Dict

create_dataloader(data, entity_types=None, *args, **kwargs)[source]¶

Create a PyTorch DataLoader with the processor’s collate function.

Parameters:

data – Dataset to load.
entity_types – Optional entity types for extraction.
*args – Additional positional arguments for DataLoader.
**kwargs – Additional keyword arguments for DataLoader.

Returns:

DataLoader instance configured with this processor’s collate_fn.

Return type:

DataLoader

class gliner.data_processing.processor.UniEncoderSpanProcessor(config, tokenizer, words_splitter)[source]¶

Bases: BaseProcessor

Processor for span-based NER with uni-encoder architecture.

This processor handles span enumeration and labeling for models that predict entity types for all possible spans up to a maximum width.

Initialize the base processor.

Parameters:

config – Configuration object containing model and processing parameters.
tokenizer – Transformer tokenizer for subword tokenization.
words_splitter – Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.

preprocess_example(tokens, ner, classes_to_id)[source]¶

Preprocess a single example for span-based prediction.

Enumerates all possible spans up to max_width and creates labels for each span based on NER annotations.

Parameters:

tokens – List of token strings.
ner – List of NER annotations as (start, end, label) tuples.
classes_to_id – Mapping from class labels to integer IDs.

Returns:

tokens: Token strings
span_idx: Tensor of span indices (start, end)
span_label: Tensor of span labels
seq_length: Sequence length
entities: Original NER annotations

Return type:

Dictionary containing

Warning

UserWarning: If sequence length exceeds max_len (gets truncated).

create_batch_dict(batch, class_to_ids, id_to_classes)[source]¶

Create a batch dictionary from preprocessed span examples.

Parameters:

batch – List of preprocessed example dictionaries.
class_to_ids – List of class-to-ID mappings.
id_to_classes – List of ID-to-class mappings.

Returns:

seq_length: Sequence lengths
span_idx: Padded span indices
tokens: Token strings
span_mask: Mask for valid spans
span_label: Padded span labels
entities: Original NER annotations
classes_to_id: Class mappings
id_to_classes: Reverse class mappings

Return type:

Dictionary containing

create_labels(batch)[source]¶

Create one-hot encoded labels for spans.

Creates multi-label one-hot vectors for each span, allowing spans to have multiple entity types.

Parameters:: batch – Batch dictionary containing tokens, entities, and class mappings.
Returns:: Tensor of shape (batch_size, max_spans, num_classes) containing one-hot encoded labels.

tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]¶

Tokenize inputs and prepare span labels for a batch.

Parameters:

batch – Batch dictionary with tokens and class mappings.
prepare_labels – Whether to prepare labels.
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing tokenized inputs and optionally labels.

class gliner.data_processing.processor.UniEncoderTokenProcessor(config, tokenizer, words_splitter)[source]¶

Bases: BaseProcessor

Processor for token-based NER with uni-encoder architecture.

This processor handles token-level classification where each token is labeled with BIO-style tags (Begin, Inside, Outside) for each entity type.

Initialize the base processor.

Parameters:

config – Configuration object containing model and processing parameters.
tokenizer – Transformer tokenizer for subword tokenization.
words_splitter – Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.

preprocess_example(tokens, ner, classes_to_id)[source]¶

Preprocess a single example for token-based prediction.

Parameters:

tokens – List of token strings.
ner – List of NER annotations as (start, end, label) tuples.
classes_to_id – Mapping from class labels to integer IDs.

Returns:

tokens: Token strings
seq_length: Sequence length
entities: Original NER annotations
entities_id: Entity annotations with class IDs

Return type:

Dictionary containing

Warning

UserWarning: If sequence length exceeds max_len (gets truncated).

create_batch_dict(batch, class_to_ids, id_to_classes)[source]¶

Create a batch dictionary from preprocessed token examples.

Parameters:

batch – List of preprocessed example dictionaries.
class_to_ids – List of class-to-ID mappings.
id_to_classes – List of ID-to-class mappings.

Returns:

tokens: Token strings
seq_length: Sequence lengths
entities: Original NER annotations
entities_id: Entity annotations with class IDs
classes_to_id: Class mappings
id_to_classes: Reverse class mappings

Return type:

Dictionary containing

create_labels(entities_id, batch_size, seq_len, num_classes)[source]¶

Create token-level labels with begin/inside/end markers.

Creates labels indicating which tokens are at the start, end, or inside of entity spans for each entity type.

Parameters:

entities_id – List of entity annotations with class IDs for each example.
batch_size – Size of the batch.
seq_len – Maximum sequence length in batch.
num_classes – Number of entity classes.

Returns:

Tensor of shape (batch_size, seq_len, num_classes, 3) where the last dimension contains [start_marker, end_marker, inside_marker].

tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]¶

Tokenize inputs and prepare token-level labels for a batch.

Parameters:

batch – Batch dictionary with tokens and class mappings.
prepare_labels – Whether to prepare labels.
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing tokenized inputs and optionally labels.

class gliner.data_processing.processor.BaseBiEncoderProcessor(config, tokenizer, words_splitter, labels_tokenizer)[source]¶

Bases: BaseProcessor

Base processor for bi-encoder architectures.

Bi-encoder models use separate encoders for text and entity types.

Initialize the bi-encoder processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer for text encoding.
words_splitter – Word-level tokenizer/splitter.
labels_tokenizer – Separate tokenizer for entity type encoding.

__init__(config, tokenizer, words_splitter, labels_tokenizer)[source]¶

Initialize the bi-encoder processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer for text encoding.
words_splitter – Word-level tokenizer/splitter.
labels_tokenizer – Separate tokenizer for entity type encoding.

tokenize_inputs(texts, entities=None)[source]¶

Tokenize inputs for bi-encoder architecture.

Separately tokenizes text sequences and entity types using different tokenizers.

Parameters:

texts – Sequences of token strings.
entities – Optional list of entity types to encode.

Returns:

input_ids: Text token IDs
attention_mask: Text attention mask
words_mask: Word-level mask
labels_input_ids: Entity type token IDs (if entities provided)
labels_attention_mask: Entity type attention mask (if entities provided)

Return type:

Dictionary containing

batch_generate_class_mappings(batch_list, *args)[source]¶

Generate class mappings for bi-encoder with batch-level type pooling.

Unlike uni-encoder which generates per-example mappings, bi-encoder creates a single shared mapping across the batch for more efficient entity type encoding.

Parameters:

batch_list (List[Dict]) – List of example dictionaries.
*args – Variable length argument list (unused).

Returns:

List of identical class-to-ID mappings (one per example)
List of identical ID-to-class mappings (one per example)

Return type:

Tuple containing

class gliner.data_processing.processor.BiEncoderSpanProcessor(config, tokenizer, words_splitter, labels_tokenizer)[source]¶

Bases: UniEncoderSpanProcessor, BaseBiEncoderProcessor

Processor for span-based NER with bi-encoder architecture.

Combines span enumeration from UniEncoderSpanProcessor with the bi-encoder approach from BaseBiEncoderProcessor.

Initialize the bi-encoder processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer for text encoding.
words_splitter – Word-level tokenizer/splitter.
labels_tokenizer – Separate tokenizer for entity type encoding.

tokenize_and_prepare_labels(batch, prepare_labels, prepare_entities=True, *args, **kwargs)[source]¶

Tokenize inputs and prepare span labels for bi-encoder.

Parameters:

batch – Batch dictionary with tokens and class mappings.
prepare_labels – Whether to prepare labels.
prepare_entities – Whether to encode entity types separately.
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing tokenized inputs, entity encodings, and optionally labels.

class gliner.data_processing.processor.BiEncoderTokenProcessor(config, tokenizer, words_splitter, labels_tokenizer)[source]¶

Bases: UniEncoderTokenProcessor, BaseBiEncoderProcessor

Processor for token-based NER with bi-encoder architecture.

Combines token-level classification from UniEncoderTokenProcessor with the dual-encoder approach from BaseBiEncoderProcessor.

Initialize the bi-encoder processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer for text encoding.
words_splitter – Word-level tokenizer/splitter.
labels_tokenizer – Separate tokenizer for entity type encoding.

tokenize_and_prepare_labels(batch, prepare_labels, prepare_entities=True, **kwargs)[source]¶

Tokenize inputs and prepare token-level labels for bi-encoder.

Parameters:

batch – Batch dictionary with tokens and class mappings.
prepare_labels – Whether to prepare labels.
prepare_entities – Whether to encode entity types separately.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing tokenized inputs, entity encodings, and optionally labels.

class gliner.data_processing.processor.UniEncoderSpanDecoderProcessor(config, tokenizer, words_splitter, decoder_tokenizer)[source]¶

Bases: UniEncoderSpanProcessor

Processor for span-based NER with encoder-decoder architecture.

Extends span-based processing with a decoder that generates entity type labels autoregressively, enabling more flexible prediction strategies.

Initialize the encoder-decoder processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer for encoding.
words_splitter – Word-level tokenizer/splitter.
decoder_tokenizer – Separate tokenizer for decoder (label generation).

__init__(config, tokenizer, words_splitter, decoder_tokenizer)[source]¶

Initialize the encoder-decoder processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer for encoding.
words_splitter – Word-level tokenizer/splitter.
decoder_tokenizer – Separate tokenizer for decoder (label generation).

tokenize_inputs(texts, entities, blank=None)[source]¶

Tokenize inputs for encoder-decoder architecture.

Prepares both encoder and decoder inputs, with optional decoder context based on configuration.

Parameters:

texts – Sequences of token strings.
entities – Entity types for extraction.
blank – Optional blank entity token for zero-shot scenarios.

Returns:

Dictionary containing encoder and decoder tokenized inputs.

create_labels(batch, blank=None)[source]¶

Create labels for both span classification and decoder generation.

Parameters:

batch – Batch dictionary containing tokens, entities, and class mappings.
blank – Optional blank entity token for zero-shot scenarios.

Returns:

Span classification labels (one-hot encoded)
Decoder generation labels (tokenized entity types) or None

Return type:

Tuple containing

tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]¶

Tokenize inputs and prepare labels for encoder-decoder training.

Parameters:

batch – Batch dictionary with tokens and class mappings.
prepare_labels – Whether to prepare labels.
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing encoder inputs, decoder inputs, and labels.

class gliner.data_processing.processor.RelationExtractionSpanProcessor(config, tokenizer, words_splitter)[source]¶

Bases: UniEncoderSpanProcessor

Processor for joint entity and relation extraction.

Extends span-based NER processing to additionally handle relation extraction between entity pairs, supporting end-to-end joint training.

Initialize the relation extraction processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer.
words_splitter – Word-level tokenizer/splitter.

__init__(config, tokenizer, words_splitter)[source]¶

Initialize the relation extraction processor.

Parameters:

config – Configuration object.
tokenizer – Transformer tokenizer.
words_splitter – Word-level tokenizer/splitter.

batch_generate_class_mappings(batch_list, ner_negatives=None, rel_negatives=None, sampled_neg=100)[source]¶

Generate class mappings for both entities and relations.

Creates separate mappings for entity types and relation types with support for negative sampling for both.

Parameters:

batch_list (List[Dict]) – List of example dictionaries.
ner_negatives (List[str] | None) – Optional pre-sampled negative entity types.
rel_negatives (List[str] | None) – Optional pre-sampled negative relation types.
sampled_neg (int) – Number of negative types to sample if negatives not provided.

Returns:

List of entity class-to-ID mappings
List of entity ID-to-class mappings
List of relation class-to-ID mappings
List of relation ID-to-class mappings

Return type:

Tuple containing

collate_raw_batch(batch_list, entity_types=None, relation_types=None, ner_negatives=None, rel_negatives=None, class_to_ids=None, id_to_classes=None, rel_class_to_ids=None, rel_id_to_classes=None, key='ner')[source]¶

Collate a raw batch with entity and relation label mappings.

Parameters:

batch_list (List[Dict]) – List of raw example dictionaries.
entity_types (List[str | List[str]] | None) – Optional predefined entity types.
relation_types (List[str | List[str]] | None) – Optional predefined relation types.
ner_negatives (List[str] | None) – Optional negative entity types.
rel_negatives (List[str] | None) – Optional negative relation types.
class_to_ids (Dict[str, int] | List[Dict[str, int]] | None) – Optional entity class-to-ID mapping(s).
id_to_classes (Dict[int, str] | List[Dict[int, str]] | None) – Optional entity ID-to-class mapping(s).
rel_class_to_ids (Dict[str, int] | List[Dict[str, int]] | None) – Optional relation class-to-ID mapping(s).
rel_id_to_classes (Dict[int, str] | List[Dict[int, str]] | None) – Optional relation ID-to-class mapping(s).
key – Key for accessing labels in batch (default: ‘ner’).

Returns:

Dictionary containing collated batch data for joint entity and relation extraction.

Return type:

Dict

preprocess_example(tokens, ner, classes_to_id, relations, rel_classes_to_id)[source]¶

Preprocess a single example for joint entity and relation extraction.

Processes both entity spans and relation triplets, ensuring consistent indexing when entities are reordered.

Parameters:

tokens – List of token strings.
ner – List of entity annotations as (start, end, label) tuples.
classes_to_id – Mapping from entity class labels to integer IDs.
relations – List of relation annotations as (head_idx, tail_idx, rel_type) tuples.
rel_classes_to_id – Mapping from relation class labels to integer IDs.

Returns:

tokens: Token strings
span_idx: Tensor of span indices
span_label: Tensor of entity labels for each span
seq_length: Sequence length
entities: Original entity annotations
relations: Original relation annotations
rel_idx: Tensor of relation head/tail indices
rel_label: Tensor of relation type labels

Return type:

Dictionary containing

Warning

UserWarning: If sequence length exceeds max_len (gets truncated).

create_batch_dict(batch, class_to_ids, id_to_classes, rel_class_to_ids, rel_id_to_classes)[source]¶

Create a batch dictionary from preprocessed relation extraction examples.

Parameters:

batch – List of preprocessed example dictionaries.
class_to_ids – List of entity class-to-ID mappings.
id_to_classes – List of entity ID-to-class mappings.
rel_class_to_ids – List of relation class-to-ID mappings.
rel_id_to_classes – List of relation ID-to-class mappings.

Returns:

Dictionary containing all batch data for joint entity and relation extraction, including entity spans, relation pairs, and their labels.

create_relation_labels(batch, add_reversed_negatives=True, add_random_negatives=True, negative_ratio=2.0)[source]¶

Create relation labels with negative pair sampling.

Generates training labels for relation extraction including both positive relation pairs and carefully sampled negative pairs for contrastive learning.

Parameters:

batch – Batch dictionary containing entities and relations.
add_reversed_negatives – If True, add reversed direction pairs as negatives (h,t) -> (t,h). These are important hard negatives for learning relation directionality.
add_random_negatives – If True, add random entity pairs as negatives to provide additional training signal.
negative_ratio – Ratio of negative to positive pairs. For example, 2.0 means twice as many negatives as positives.

Returns:

adj_matrix: Adjacency matrix indicating which entity pairs to consider (shape: [B, max_entities, max_entities])
rel_matrix: Multi-hot encoded relation labels for each pair (shape: [B, max_pairs, num_relation_classes])

Return type:

Tuple containing

prepare_inputs(texts, entities, blank=None, relations=None, **kwargs)[source]¶

Prepare input texts with entity and relation type prompts.

Extends the base prepare_inputs to include relation type tokens in the prompt.

Parameters:

texts (Sequence[Sequence[str]]) – Sequences of token strings, one per example.
entities (Sequence[Sequence[str]] | Dict[int, Sequence[str]] | Sequence[str]) – Entity types to extract.
blank (str | None) – Optional blank entity token for zero-shot scenarios.
relations (Sequence[Sequence[str]] | Dict[int, Sequence[str]] | Sequence[str] | None) – Relation types to extract (optional).
**kwargs – Additional keyword arguments.

Returns:

List of input text sequences with prepended prompts
List of prompt lengths for each example

Return type:

Tuple containing

tokenize_inputs(texts, entities, blank=None, relations=None, **kwargs)[source]¶

Tokenize input texts with entity and relation prompts.

Parameters:

texts – Sequences of token strings.
entities – Entity types for extraction.
blank – Optional blank entity token.
relations – Optional relation types for extraction.
**kwargs – Additional keyword arguments.

Returns:

Dictionary containing tokenized inputs with word masks.

tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]¶

Tokenize inputs and prepare labels for joint entity-relation extraction.

Parameters:

batch – Batch dictionary with tokens, entities, relations, and class mappings.
prepare_labels – Whether to prepare labels.
*args – Variable length argument list.
**kwargs – Arbitrary keyword arguments.

Returns:

Dictionary containing tokenized inputs, entity labels, relation adjacency matrix, and relation labels.