gliner.data_processing.processor moduleΒΆ
- class gliner.data_processing.processor.BaseProcessor(config, tokenizer, words_splitter)[source]ΒΆ
Bases:
ABCAbstract base class for data processors.
This class provides the common interface and utilities for all processor implementations, handling tokenization, label preparation, and batch collation for NER and RE tasks.
Initialize the base processor.
- Parameters:
config β Configuration object containing model and processing parameters.
tokenizer β Transformer tokenizer for subword tokenization.
words_splitter β Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.
- __init__(config, tokenizer, words_splitter)[source]ΒΆ
Initialize the base processor.
- Parameters:
config β Configuration object containing model and processing parameters.
tokenizer β Transformer tokenizer for subword tokenization.
words_splitter β Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.
- static get_dict(spans, classes_to_id)[source]ΒΆ
Create a dictionary mapping spans to their class IDs.
- Parameters:
spans (List[Tuple[int, int, str]]) β List of tuples (start, end, label) representing entity spans.
classes_to_id (Dict[str, int]) β Mapping from class labels to integer IDs.
- Returns:
Dictionary mapping (start, end) tuples to class IDs.
- Return type:
Dict[Tuple[int, int], int]
- abstract preprocess_example(tokens, ner, classes_to_id)[source]ΒΆ
Preprocess a single example for model input.
- Parameters:
tokens (List[str]) β List of token strings.
ner (List[Tuple[int, int, str]]) β List of NER annotations as (start, end, label) tuples.
classes_to_id (Dict[str, int]) β Mapping from class labels to integer IDs.
- Returns:
Dictionary containing preprocessed example data.
- Raises:
NotImplementedError β Must be implemented by subclasses.
- Return type:
Dict
- abstract create_labels()[source]ΒΆ
Create label tensors from batch data.
- Returns:
Tensor containing labels for the batch.
- Raises:
NotImplementedError β Must be implemented by subclasses.
- Return type:
Tensor
- abstract tokenize_and_prepare_labels()[source]ΒΆ
Tokenize inputs and prepare labels for a batch.
- Raises:
NotImplementedError β Must be implemented by subclasses.
- prepare_inputs(texts, entities, blank=None, add_entities=True, **kwargs)[source]ΒΆ
Prepare input texts with entity type prompts.
Prepends entity type special tokens that aggregates entity label information.
- Parameters:
texts (Sequence[Sequence[str]]) β Sequences of token strings, one per example.
entities (Sequence[Sequence[str]] | Dict[int, Sequence[str]] | Sequence[str]) β Entity types to extract. Can be: - List of lists (per-example entity types) - Dictionary (shared entity types) - List of strings (same types for all examples)
blank (str | None) β Optional blank entity token for zero-shot scenarios.
add_entities (bool | None) β Whether to add entity text string to the prompt.
**kwargs β Additional keyword arguments.
- Returns:
List of input text sequences with prepended prompts
List of prompt lengths for each example
- Return type:
Tuple containing
- prepare_word_mask(texts, tokenized_inputs, skip_first_words=None, token_level=False)[source]ΒΆ
Prepare word-level masks for tokenized inputs.
Creates masks that map subword tokens back to their original words.
- Parameters:
texts β Original text sequences.
tokenized_inputs β Tokenized inputs from transformer tokenizer.
skip_first_words β Optional list of word counts to skip per example (e.g., prompt words).
token_level β If True, create token-level masks instead of word-level.
- Returns:
Word mask array.
- tokenize_inputs(texts, entities, blank=None, **kwargs)[source]ΒΆ
Tokenize input texts with entity prompts.
- Parameters:
texts β Sequences of token strings.
entities β Entity types for extraction.
blank β Optional blank entity token.
**kwargs β Additional keyword arguments.
- Returns:
input_ids: Token IDs
attention_mask: Attention mask
words_mask: Word-level mask
- Return type:
Dictionary containing tokenized inputs with keys
- batch_generate_class_mappings(batch_list, negatives=None, key='ner', sampled_neg=100)[source]ΒΆ
Generate class mappings for a batch with negative sampling.
Creates bidirectional mappings between class labels and integer IDs, with support for negative type sampling to improve model robustness.
- Parameters:
batch_list (List[Dict]) β List of example dictionaries.
negatives (List[str] | None) β Optional pre-sampled negative types. If None, samples from batch.
key (str) β Key to access labels in batch dictionaries (default: βnerβ).
sampled_neg (int) β Number of negative types to sample if negatives is None.
- Returns:
List of class-to-ID mappings (one per example)
List of ID-to-class mappings (one per example)
- Return type:
Tuple containing
- collate_raw_batch(batch_list, entity_types=None, negatives=None, class_to_ids=None, id_to_classes=None, key='ner')[source]ΒΆ
Collate a raw batch with optional dynamic or provided label mappings.
- Parameters:
batch_list (List[Dict]) β List of raw example dictionaries.
entity_types (List[str | List[str]] | None) β Optional predefined entity types. Can be a single list for all examples or list of lists for per-example types.
negatives (List[str] | None) β Optional list of negative entity types.
class_to_ids (Dict[str, int] | List[Dict[str, int]] | None) β Optional predefined class-to-ID mapping(s).
id_to_classes (Dict[int, str] | List[Dict[int, str]] | None) β Optional predefined ID-to-class mapping(s).
key β Key for accessing labels in batch (default: βnerβ).
- Returns:
Dictionary containing collated batch data ready for model input.
- Return type:
Dict
- collate_fn(batch, prepare_labels=True, *args, **kwargs)[source]ΒΆ
Collate function for DataLoader.
- Parameters:
batch β Batch of examples from dataset.
prepare_labels β Whether to prepare labels (default: True).
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing model inputs and labels.
- abstract create_batch_dict(batch, class_to_ids, id_to_classes)[source]ΒΆ
Create a batch dictionary from preprocessed examples.
- Parameters:
batch (List[Dict]) β List of preprocessed example dictionaries.
class_to_ids (List[Dict[str, int]]) β List of class-to-ID mappings.
id_to_classes (List[Dict[int, str]]) β List of ID-to-class mappings.
- Returns:
Dictionary containing collated batch tensors.
- Raises:
NotImplementedError β Must be implemented by subclasses.
- Return type:
Dict
- create_dataloader(data, entity_types=None, *args, **kwargs)[source]ΒΆ
Create a PyTorch DataLoader with the processorβs collate function.
- Parameters:
data β Dataset to load.
entity_types β Optional entity types for extraction.
*args β Additional positional arguments for DataLoader.
**kwargs β Additional keyword arguments for DataLoader.
- Returns:
DataLoader instance configured with this processorβs collate_fn.
- Return type:
DataLoader
- class gliner.data_processing.processor.UniEncoderSpanProcessor(config, tokenizer, words_splitter)[source]ΒΆ
Bases:
BaseProcessorProcessor for span-based NER with uni-encoder architecture.
This processor handles span enumeration and labeling for models that predict entity types for all possible spans up to a maximum width.
Initialize the base processor.
- Parameters:
config β Configuration object containing model and processing parameters.
tokenizer β Transformer tokenizer for subword tokenization.
words_splitter β Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.
- preprocess_example(tokens, ner, classes_to_id)[source]ΒΆ
Preprocess a single example for span-based prediction.
Enumerates all possible spans up to max_width and creates labels for each span based on NER annotations.
- Parameters:
tokens β List of token strings.
ner β List of NER annotations as (start, end, label) tuples.
classes_to_id β Mapping from class labels to integer IDs.
- Returns:
tokens: Token strings
span_idx: Tensor of span indices (start, end)
span_label: Tensor of span labels
seq_length: Sequence length
entities: Original NER annotations
- Return type:
Dictionary containing
Warning
UserWarning: If sequence length exceeds max_len (gets truncated).
- create_batch_dict(batch, class_to_ids, id_to_classes)[source]ΒΆ
Create a batch dictionary from preprocessed span examples.
- Parameters:
batch β List of preprocessed example dictionaries.
class_to_ids β List of class-to-ID mappings.
id_to_classes β List of ID-to-class mappings.
- Returns:
seq_length: Sequence lengths
span_idx: Padded span indices
tokens: Token strings
span_mask: Mask for valid spans
span_label: Padded span labels
entities: Original NER annotations
classes_to_id: Class mappings
id_to_classes: Reverse class mappings
- Return type:
Dictionary containing
- create_labels(batch)[source]ΒΆ
Create one-hot encoded labels for spans.
Creates multi-label one-hot vectors for each span, allowing spans to have multiple entity types.
- Parameters:
batch β Batch dictionary containing tokens, entities, and class mappings.
- Returns:
Tensor of shape (batch_size, max_spans, num_classes) containing one-hot encoded labels.
- tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare span labels for a batch.
- Parameters:
batch β Batch dictionary with tokens and class mappings.
prepare_labels β Whether to prepare labels.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing tokenized inputs and optionally labels.
- class gliner.data_processing.processor.UniEncoderTokenProcessor(config, tokenizer, words_splitter)[source]ΒΆ
Bases:
BaseProcessorProcessor for token-based NER with uni-encoder architecture.
This processor handles token-level classification where each token is labeled with BIO-style tags (Begin, Inside, Outside) for each entity type.
Initialize the base processor.
- Parameters:
config β Configuration object containing model and processing parameters.
tokenizer β Transformer tokenizer for subword tokenization.
words_splitter β Word-level tokenizer/splitter. If None, creates one based on config.words_splitter_type.
- preprocess_example(tokens, ner, classes_to_id)[source]ΒΆ
Preprocess a single example for token-based prediction.
- Parameters:
tokens β List of token strings.
ner β List of NER annotations as (start, end, label) tuples.
classes_to_id β Mapping from class labels to integer IDs.
- Returns:
tokens: Token strings
seq_length: Sequence length
entities: Original NER annotations
span_idx: Tensor of entity span indices (if represent_spans=True)
span_label: Tensor of entity class IDs (if represent_spans=True)
- Return type:
Dictionary containing
Warning
UserWarning: If sequence length exceeds max_len (gets truncated).
- create_batch_dict(batch, class_to_ids, id_to_classes)[source]ΒΆ
Create a batch dictionary from preprocessed token examples.
- Parameters:
batch β List of preprocessed example dictionaries.
class_to_ids β List of class-to-ID mappings.
id_to_classes β List of ID-to-class mappings.
- Returns:
tokens: Token strings
seq_length: Sequence lengths
entities: Original NER annotations
span_idx: Padded span indices (if available)
span_label: Padded span labels (if available)
span_mask: Mask for valid spans (if available)
classes_to_id: Class mappings
id_to_classes: Reverse class mappings
- Return type:
Dictionary containing
- create_labels(batch)[source]ΒΆ
Create token-level labels with begin/inside/end markers.
Creates labels indicating which tokens are at the start, end, or inside of entity spans for each entity type.
- Parameters:
batch β List[Any] batch of data
- Returns:
Tensor of shape (batch_size, seq_len, num_classes, 3) where the last dimension contains [start_marker, end_marker, inside_marker].
- create_span_labels(batch)[source]ΒΆ
Create one-hot encoded labels for spans with negative sampling.
Creates one-hot encoded labels for entity spans, converting 1-indexed class IDs to 0-indexed format. Labels with class ID 0 (negative spans) or -1 (invalid spans) are represented as all zeros in the one-hot encoding.
- Parameters:
batch β Batch dictionary containing span_label, span_mask, and classes_to_id.
- Returns:
Tensor of shape (batch_size, max_spans, num_classes) containing one-hot encoded labels where:
Positive spans: one-hot vector at position (class_id - 1)
Negative/invalid spans: all zeros
- tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare token-level labels for a batch.
- Parameters:
batch β Batch dictionary with tokens and class mappings.
prepare_labels β Whether to prepare labels.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing tokenized inputs and optionally labels.
- class gliner.data_processing.processor.BaseBiEncoderProcessor(config, tokenizer, words_splitter, labels_tokenizer)[source]ΒΆ
Bases:
BaseProcessorBase processor for bi-encoder architectures.
Bi-encoder models use separate encoders for text and entity types.
Initialize the bi-encoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for text encoding.
words_splitter β Word-level tokenizer/splitter.
labels_tokenizer β Separate tokenizer for entity type encoding.
- __init__(config, tokenizer, words_splitter, labels_tokenizer)[source]ΒΆ
Initialize the bi-encoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for text encoding.
words_splitter β Word-level tokenizer/splitter.
labels_tokenizer β Separate tokenizer for entity type encoding.
- tokenize_inputs(texts, entities=None)[source]ΒΆ
Tokenize inputs for bi-encoder architecture.
Separately tokenizes text sequences and entity types using different tokenizers.
- Parameters:
texts β Sequences of token strings.
entities β Optional list of entity types to encode.
- Returns:
input_ids: Text token IDs
attention_mask: Text attention mask
words_mask: Word-level mask
labels_input_ids: Entity type token IDs (if entities provided)
labels_attention_mask: Entity type attention mask (if entities provided)
- Return type:
Dictionary containing
- batch_generate_class_mappings(batch_list, *args)[source]ΒΆ
Generate class mappings for bi-encoder with batch-level type pooling.
Unlike uni-encoder which generates per-example mappings, bi-encoder creates a single shared mapping across the batch for more efficient entity type encoding.
- Parameters:
batch_list (List[Dict]) β List of example dictionaries.
*args β Variable length argument list (unused).
- Returns:
List of identical class-to-ID mappings (one per example)
List of identical ID-to-class mappings (one per example)
- Return type:
Tuple containing
- class gliner.data_processing.processor.BiEncoderSpanProcessor(config, tokenizer, words_splitter, labels_tokenizer)[source]ΒΆ
Bases:
UniEncoderSpanProcessor,BaseBiEncoderProcessorProcessor for span-based NER with bi-encoder architecture.
Combines span enumeration from UniEncoderSpanProcessor with the bi-encoder approach from BaseBiEncoderProcessor.
Initialize the bi-encoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for text encoding.
words_splitter β Word-level tokenizer/splitter.
labels_tokenizer β Separate tokenizer for entity type encoding.
- tokenize_and_prepare_labels(batch, prepare_labels, prepare_entities=True, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare span labels for bi-encoder.
- Parameters:
batch β Batch dictionary with tokens and class mappings.
prepare_labels β Whether to prepare labels.
prepare_entities β Whether to encode entity types separately.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing tokenized inputs, entity encodings, and optionally labels.
- class gliner.data_processing.processor.BiEncoderTokenProcessor(config, tokenizer, words_splitter, labels_tokenizer)[source]ΒΆ
Bases:
UniEncoderTokenProcessor,BaseBiEncoderProcessorProcessor for token-based NER with bi-encoder architecture.
Combines token-level classification from UniEncoderTokenProcessor with the dual-encoder approach from BaseBiEncoderProcessor.
Initialize the bi-encoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for text encoding.
words_splitter β Word-level tokenizer/splitter.
labels_tokenizer β Separate tokenizer for entity type encoding.
- tokenize_and_prepare_labels(batch, prepare_labels, prepare_entities=True, **kwargs)[source]ΒΆ
Tokenize inputs and prepare token-level labels for bi-encoder.
- Parameters:
batch β Batch dictionary with tokens and class mappings.
prepare_labels β Whether to prepare labels.
prepare_entities β Whether to encode entity types separately.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing tokenized inputs, entity encodings, and optionally labels.
- class gliner.data_processing.processor.UniEncoderSpanDecoderProcessor(config, tokenizer, words_splitter, decoder_tokenizer)[source]ΒΆ
Bases:
UniEncoderSpanProcessorProcessor for span-based NER with encoder-decoder architecture.
Extends span-based processing with a decoder that generates entity type labels autoregressively, enabling more flexible prediction strategies.
Initialize the encoder-decoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for encoding.
words_splitter β Word-level tokenizer/splitter.
decoder_tokenizer β Separate tokenizer for decoder (label generation).
- __init__(config, tokenizer, words_splitter, decoder_tokenizer)[source]ΒΆ
Initialize the encoder-decoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for encoding.
words_splitter β Word-level tokenizer/splitter.
decoder_tokenizer β Separate tokenizer for decoder (label generation).
- tokenize_inputs(texts, entities, blank=None)[source]ΒΆ
Tokenize inputs for encoder-decoder architecture.
Prepares both encoder and decoder inputs, with optional decoder context based on configuration.
- Parameters:
texts β Sequences of token strings.
entities β Entity types for extraction.
blank β Optional blank entity token for zero-shot scenarios.
- Returns:
Dictionary containing encoder and decoder tokenized inputs.
- create_labels(batch, blank=None)[source]ΒΆ
Create labels for both span classification and decoder generation.
- Parameters:
batch β Batch dictionary containing tokens, entities, and class mappings.
blank β Optional blank entity token for zero-shot scenarios.
- Returns:
Span classification labels (one-hot encoded)
Decoder generation labels (tokenized entity types) or None
- Return type:
Tuple containing
- tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare labels for encoder-decoder training.
- Parameters:
batch β Batch dictionary with tokens and class mappings.
prepare_labels β Whether to prepare labels.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing encoder inputs, decoder inputs, and labels.
- class gliner.data_processing.processor.UniEncoderTokenDecoderProcessor(config, tokenizer, words_splitter, decoder_tokenizer)[source]ΒΆ
Bases:
UniEncoderSpanDecoderProcessor,UniEncoderTokenProcessorProcessor for token-based NER with encoder-decoder architecture.
This processor combines token-level BIO-style classification with a decoder that generates entity type labels autoregressively, enabling more flexible prediction strategies for token-level NER tasks.
- Inherits from:
UniEncoderSpanDecoderProcessor: Encoder-decoder architecture and decoder utilities
UniEncoderTokenProcessor: Token-level BIO tagging for entities
Initialize the token-level encoder-decoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for encoding.
words_splitter β Word-level tokenizer/splitter.
decoder_tokenizer β Separate tokenizer for decoder (label generation).
- __init__(config, tokenizer, words_splitter, decoder_tokenizer)[source]ΒΆ
Initialize the token-level encoder-decoder processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer for encoding.
words_splitter β Word-level tokenizer/splitter.
decoder_tokenizer β Separate tokenizer for decoder (label generation).
- preprocess_example(tokens, ner, classes_to_id)[source]ΒΆ
Preprocess a single example for token-level encoder-decoder prediction.
Uses token-level preprocessing from UniEncoderTokenProcessor while preparing for decoder-based label generation.
- Parameters:
tokens β List of token strings.
ner β List of NER annotations as (start, end, label) tuples.
classes_to_id β Mapping from class labels to integer IDs.
- Returns:
tokens: Token strings
seq_length: Sequence length
entities: Original NER annotations
span_idx: Tensor of entity span indices (if represent_spans=True)
span_label: Tensor of entity class IDs (if represent_spans=True)
- Return type:
Dictionary containing
Warning
UserWarning: If sequence length exceeds max_len (gets truncated).
- create_batch_dict(batch, class_to_ids, id_to_classes)[source]ΒΆ
Create a batch dictionary from preprocessed token examples.
- Parameters:
batch β List of preprocessed example dictionaries.
class_to_ids β List of class-to-ID mappings.
id_to_classes β List of ID-to-class mappings.
- Returns:
Dictionary containing all batch data for token-level encoder-decoder processing.
- create_labels(batch, blank=None)[source]ΒΆ
Create labels for both token classification and decoder generation.
Creates both token-level BIO labels and decoder generation labels for entity types.
- Parameters:
batch β Batch dictionary containing tokens, entities, and class mappings.
blank β Optional blank entity token for zero-shot scenarios.
- Returns:
Token-level labels (BIO-style, shape: [batch_size, seq_len, num_classes, 3])
Decoder generation labels (tokenized entity types) or None
- Return type:
Tuple containing
- tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare labels for token-level encoder-decoder training.
Combines token-level input processing with decoder inputs and prepares both token-level BIO labels and decoder generation labels.
- Parameters:
batch β Batch dictionary with tokens and class mappings.
prepare_labels β Whether to prepare labels.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing encoder inputs, decoder inputs, token-level labels, and decoder labels.
- class gliner.data_processing.processor.RelationExtractionSpanProcessor(config, tokenizer, words_splitter)[source]ΒΆ
Bases:
UniEncoderSpanProcessorProcessor for joint entity and relation extraction.
Extends span-based NER processing to additionally handle relation extraction between entity pairs, supporting end-to-end joint training.
Initialize the relation extraction processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer.
words_splitter β Word-level tokenizer/splitter.
- __init__(config, tokenizer, words_splitter)[source]ΒΆ
Initialize the relation extraction processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer.
words_splitter β Word-level tokenizer/splitter.
- batch_generate_class_mappings(batch_list, ner_negatives=None, rel_negatives=None, sampled_neg=100)[source]ΒΆ
Generate class mappings for both entities and relations.
Creates separate mappings for entity types and relation types with support for negative sampling for both.
- Parameters:
batch_list (List[Dict]) β List of example dictionaries.
ner_negatives (List[str] | None) β Optional pre-sampled negative entity types.
rel_negatives (List[str] | None) β Optional pre-sampled negative relation types.
sampled_neg (int) β Number of negative types to sample if negatives not provided.
- Returns:
List of entity class-to-ID mappings
List of entity ID-to-class mappings
List of relation class-to-ID mappings
List of relation ID-to-class mappings
- Return type:
Tuple containing
- collate_raw_batch(batch_list, entity_types=None, relation_types=None, ner_negatives=None, rel_negatives=None, class_to_ids=None, id_to_classes=None, rel_class_to_ids=None, rel_id_to_classes=None, key='ner')[source]ΒΆ
Collate a raw batch with entity and relation label mappings.
- Parameters:
batch_list (List[Dict]) β List of raw example dictionaries.
entity_types (List[str | List[str]] | None) β Optional predefined entity types.
relation_types (List[str | List[str]] | None) β Optional predefined relation types.
ner_negatives (List[str] | None) β Optional negative entity types.
rel_negatives (List[str] | None) β Optional negative relation types.
class_to_ids (Dict[str, int] | List[Dict[str, int]] | None) β Optional entity class-to-ID mapping(s).
id_to_classes (Dict[int, str] | List[Dict[int, str]] | None) β Optional entity ID-to-class mapping(s).
rel_class_to_ids (Dict[str, int] | List[Dict[str, int]] | None) β Optional relation class-to-ID mapping(s).
rel_id_to_classes (Dict[int, str] | List[Dict[int, str]] | None) β Optional relation ID-to-class mapping(s).
key β Key for accessing labels in batch (default: βnerβ).
- Returns:
Dictionary containing collated batch data for joint entity and relation extraction.
- Return type:
Dict
- preprocess_example(tokens, ner, classes_to_id, relations, rel_classes_to_id)[source]ΒΆ
Preprocess a single example for joint entity and relation extraction.
Processes both entity spans and relation triplets, ensuring consistent indexing when entities are reordered.
- Parameters:
tokens β List of token strings.
ner β List of entity annotations as (start, end, label) tuples.
classes_to_id β Mapping from entity class labels to integer IDs.
relations β List of relation annotations as (head_idx, tail_idx, rel_type) tuples.
rel_classes_to_id β Mapping from relation class labels to integer IDs.
- Returns:
tokens: Token strings
span_idx: Tensor of span indices
span_label: Tensor of entity labels for each span
seq_length: Sequence length
entities: Original entity annotations
relations: Original relation annotations
rel_idx: Tensor of relation head/tail indices
rel_label: Tensor of relation type labels
- Return type:
Dictionary containing
Warning
UserWarning: If sequence length exceeds max_len (gets truncated).
- create_batch_dict(batch, class_to_ids, id_to_classes, rel_class_to_ids, rel_id_to_classes)[source]ΒΆ
Create a batch dictionary from preprocessed relation extraction examples.
- Parameters:
batch β List of preprocessed example dictionaries.
class_to_ids β List of entity class-to-ID mappings.
id_to_classes β List of entity ID-to-class mappings.
rel_class_to_ids β List of relation class-to-ID mappings.
rel_id_to_classes β List of relation ID-to-class mappings.
- Returns:
Dictionary containing all batch data for joint entity and relation extraction, including entity spans, relation pairs, and their labels.
- create_relation_labels(batch, add_reversed_negatives=True, add_random_negatives=True, negative_ratio=2.0)[source]ΒΆ
Create relation labels with negative pair sampling.
Overrides the span-based version to work with token-level entity representations. Uses entities_id count instead of span_label for entity counting.
- Parameters:
batch β Batch dictionary containing entities and relations.
add_reversed_negatives β If True, add reversed direction pairs as negatives.
add_random_negatives β If True, add random entity pairs as negatives.
negative_ratio β Ratio of negative to positive pairs.
- Returns:
adj_matrix: Adjacency matrix (shape: [B, max_entities, max_entities])
rel_matrix: Multi-hot relation labels (shape: [B, max_pairs, num_relation_classes])
- Return type:
Tuple containing
- prepare_inputs(texts, entities, blank=None, relations=None, **kwargs)[source]ΒΆ
Prepare input texts with entity and relation type prompts.
Extends the base prepare_inputs to include relation type tokens in the prompt.
- Parameters:
texts (Sequence[Sequence[str]]) β Sequences of token strings, one per example.
entities (Sequence[Sequence[str]] | Dict[int, Sequence[str]] | Sequence[str]) β Entity types to extract.
blank (str | None) β Optional blank entity token for zero-shot scenarios.
relations (Sequence[Sequence[str]] | Dict[int, Sequence[str]] | Sequence[str] | None) β Relation types to extract (optional).
**kwargs β Additional keyword arguments.
- Returns:
List of input text sequences with prepended prompts
List of prompt lengths for each example
- Return type:
Tuple containing
- tokenize_inputs(texts, entities, blank=None, relations=None, **kwargs)[source]ΒΆ
Tokenize input texts with entity and relation prompts.
- Parameters:
texts β Sequences of token strings.
entities β Entity types for extraction.
blank β Optional blank entity token.
relations β Optional relation types for extraction.
**kwargs β Additional keyword arguments.
- Returns:
Dictionary containing tokenized inputs with word masks.
- tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare labels for joint entity-relation extraction.
- Parameters:
batch β Batch dictionary with tokens, entities, relations, and class mappings.
prepare_labels β Whether to prepare labels.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing tokenized inputs, entity labels, relation adjacency matrix, and relation labels.
- class gliner.data_processing.processor.RelationExtractionTokenProcessor(config, tokenizer, words_splitter)[source]ΒΆ
Bases:
UniEncoderTokenProcessor,RelationExtractionSpanProcessorProcessor for joint entity and relation extraction using token-level NER.
Extends token-based NER processing to additionally handle relation extraction between entity pairs, supporting end-to-end joint training with BIO-style entity tagging.
- Inherits from:
UniEncoderTokenProcessor: Token-level BIO tagging for entities
RelationExtractionSpanProcessor: Relation extraction utilities
Initialize the relation extraction token processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer.
words_splitter β Word-level tokenizer/splitter.
- __init__(config, tokenizer, words_splitter)[source]ΒΆ
Initialize the relation extraction token processor.
- Parameters:
config β Configuration object.
tokenizer β Transformer tokenizer.
words_splitter β Word-level tokenizer/splitter.
- preprocess_example(tokens, ner, classes_to_id, relations=None, rel_classes_to_id=None)[source]ΒΆ
Preprocess a single example for joint entity and relation extraction.
Processes both entity annotations (for token-level BIO tagging) and relation triplets, ensuring consistent indexing when entities are reordered.
- Parameters:
tokens β List of token strings.
ner β List of entity annotations as (start, end, label) tuples.
classes_to_id β Mapping from entity class labels to integer IDs.
relations β List of relation annotations as (head_idx, tail_idx, rel_type) tuples.
rel_classes_to_id β Mapping from relation class labels to integer IDs.
- Returns:
tokens: Token strings
seq_length: Sequence length
entities: Original entity annotations
entities_id: Entity annotations with class IDs
relations: Original relation annotations
rel_idx: Tensor of relation head/tail entity indices
rel_label: Tensor of relation type labels
- Return type:
Dictionary containing
Warning
UserWarning: If sequence length exceeds max_len (gets truncated).
- create_batch_dict(batch, class_to_ids, id_to_classes, rel_class_to_ids=None, rel_id_to_classes=None)[source]ΒΆ
Create a batch dictionary from preprocessed relation extraction examples.
- Parameters:
batch β List of preprocessed example dictionaries.
class_to_ids β List of entity class-to-ID mappings.
id_to_classes β List of entity ID-to-class mappings.
rel_class_to_ids β List of relation class-to-ID mappings.
rel_id_to_classes β List of relation ID-to-class mappings.
- Returns:
Dictionary containing all batch data for joint entity and relation extraction with token-level entity labels.
- tokenize_and_prepare_labels(batch, prepare_labels, *args, **kwargs)[source]ΒΆ
Tokenize inputs and prepare labels for joint entity-relation extraction.
- Parameters:
batch β Batch dictionary with tokens, entities, relations, and class mappings.
prepare_labels β Whether to prepare labels.
*args β Variable length argument list.
**kwargs β Arbitrary keyword arguments.
- Returns:
Dictionary containing tokenized inputs, token-level entity labels, relation adjacency matrix, and relation labels.