Advanced UsageΒΆ

πŸš€ Basic Use CaseΒΆ

After installing the GLiNER library, import the GLiNER class. You can load your chosen model with GLiNER.from_pretrained and use inference to identify entities within your text.

from gliner import GLiNER

# Load a GLiNER model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Sample text for entity prediction
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kΙΎiΚƒΛˆtjɐnu ΚΙ”Λˆnaldu]; 
born 5 February 1985) is a Portuguese professional footballer who plays as a forward for 
and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely 
regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or 
awards, a record three UEFA Men's Player of the Year Awards, and four European Golden 
Shoes, the most by a European player.
"""

# Define labels for entity extraction
labels = ["person", "award", "date", "teams", "competition"]

# Perform entity prediction
entities = model.predict_entities(text, labels, threshold=0.5)

# Display predicted entities and their labels
for entity in entities:
    print(entity["text"], "=>", entity["label"])
Expected Output
Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award

Understanding the OutputΒΆ

Each predicted entity is a dictionary with the following structure:

{
    'start': int,      # Start character position in text
    'end': int,        # End character position in text
    'text': str,       # Extracted text span
    'label': str,      # Predicted entity type
    'score': float     # Confidence score (0-1)
}

Example:

for entity in entities:
    print(f"Text: {entity['text']}")
    print(f"Label: {entity['label']}")
    print(f"Score: {entity['score']:.3f}")
    print(f"Position: [{entity['start']}:{entity['end']}]")
    print("---")

Batch ProcessingΒΆ

For processing multiple texts efficiently, use the inference method:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Multiple texts to process
texts = [
    "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
    "Google LLC is headquartered in Mountain View.",
    "Amazon was started by Jeff Bezos in Seattle."
]

labels = ["organization", "person", "location"]

# Process all texts at once
all_entities = model.inference(texts, labels, batch_size=3, threshold=0.5)

# Display results for each text
for i, entities in enumerate(all_entities):
    print(f"\nText {i+1}: {texts[i]}")
    print("Entities:")
    for entity in entities:
        print(f"  - {entity['text']} ({entity['label']}): {entity['score']:.2f}")

Benefits of Batch Processing:

  • Faster: Process multiple texts in parallel

  • Efficient: Better GPU utilization

  • Scalable: Handle large document collections

Using Different Model ArchitecturesΒΆ

GLiNER supports multiple architecture variants, each optimized for different scenarios.

UniEncoder Models (Standard)ΒΆ

Best for general-purpose NER with up to ~30 entity types:

from gliner import GLiNER

# Load a standard UniEncoder model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

text = "Apple Inc. was founded by Steve Jobs in 1976."
labels = ["company", "person", "date"]

entities = model.predict_entities(text, labels)
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")

BiEncoder Models (Scalable)ΒΆ

Best for handling many entity types (50-200+) with pre-computed label embeddings:

from gliner import GLiNER

# Load a BiEncoder model
model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")

# BiEncoders handle many entity types efficiently
labels = [
    "person", "organization", "location", "date", "product", "event",
    "technology", "software", "hardware", "programming_language",
    "framework", "library", "database", "protocol", "standard",
    # ... can handle 100+ types efficiently
]

text = "Python is a programming language created by Guido van Rossum."
entities = model.predict_entities(text, labels)

# For production: pre-compute label embeddings
label_embeddings = model.encode_labels(labels, batch_size=16)

# Then use cached embeddings for faster inference
entities = model.predict_with_embeds(
    text, 
    label_embeddings, 
    labels,
    threshold=0.5
)

BiEncoder Advantages:

  • Handle 100+ entity types without performance degradation

  • Pre-compute label embeddings once, reuse across documents

  • Faster inference when processing many documents with same entity types

Token-Level ModelsΒΆ

Best for extracting long entity spans (multi-sentence entities, summaries):

from gliner import GLiNER

# Load a token-level model
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

# Token-level models excel at long entities
text = """
The European Union is a political and economic union of 27 member states 
that are located primarily in Europe. The EU has developed an internal 
single market through a standardised system of laws.
"""

labels = ["organization", "number", "location", "concept"]

entities = model.predict_entities(text, labels)
for entity in entities:
    print(f"{entity['text'][:50]}... => {entity['label']}")

Relation Extraction ModelsΒΆ

Extract both entities and relationships between them:

from gliner import GLiNER

# Load a relation extraction model
model = GLiNER.from_pretrained("knowledgator/gliner-relex-large-v0.5")

text = "Bill Gates founded Microsoft in 1975. The company is headquartered in Redmond."

# Define entity types and relation types
entity_labels = ["person", "organization", "date", "location"]
relation_labels = ["founded", "founded_in", "headquartered_in"]

# Extract entities and relations
entities, relations = model.inference(
    [text],
    labels=entity_labels,
    relations=relation_labels,
    threshold=0.5,
    relation_threshold=0.5
)

# Display entities
print("Entities:")
for entity in entities[0]:
    print(f"  {entity['text']} ({entity['label']})")

# Display relations
print("\nRelations:")
for relation in relations[0]:
    head = entities[0][relation['head']['entity_idx']]
    tail = entities[0][relation['tail']['entity_idx']]
    print(f"  {head['text']} --[{relation['relation']}]--> {tail['text']}")
Expected Output
Entities:
  Bill Gates (person)
  Microsoft (organization)
  1975 (date)
  Redmond (location)

Relations:
  Bill Gates --[founded]--> Microsoft
  Microsoft --[founded_in]--> 1975
  Microsoft --[headquartered_in]--> Redmond

Advanced ConfigurationΒΆ

Adjusting the ThresholdΒΆ

Control the precision-recall tradeoff:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. is a technology company."
labels = ["company", "industry"]

# High threshold: Higher precision, lower recall
entities_high = model.predict_entities(text, labels, threshold=0.7)
print(f"High threshold (0.7): {len(entities_high)} entities")

# Low threshold: Lower precision, higher recall
entities_low = model.predict_entities(text, labels, threshold=0.3)
print(f"Low threshold (0.3): {len(entities_low)} entities")

# Default threshold
entities_default = model.predict_entities(text, labels)  # threshold=0.5
print(f"Default threshold (0.5): {len(entities_default)} entities")

Relation extraction model also has two additional threshold parameters:

  • adjacency_threshold: Confidence threshold for adjacency matrix reconstruction (defaults to threshold).

  • relation_threshold: Confidence threshold for relations (defaults to threshold).

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. is a technology company founded in 1976."
labels = ["company", "industry", "date"]
relations = ["founded in"]

results = model.predict_entities(text, labels, relations=relations, threshold=0.3, adjacency_threshold=0.25, relation_threshold=0.7)

Use a lower adjacency threshold so the model can rerank and classify more pairs of entities that may be linked. Set a higher relations threshold for more specificity and better precision. Feel free to adapt all three thresholds based on your use case.### Flat vs Nested NER

Control whether entities can overlap:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "The University of California, Berkeley is located in California."
labels = ["university", "location"]

# Flat NER: No overlapping entities (default)
entities_flat = model.predict_entities(text, labels, flat_ner=True)
print("Flat NER:", [e['text'] for e in entities_flat])
# Output: ['University of California, Berkeley', 'California']

# Nested NER: Allow overlapping entities
entities_nested = model.predict_entities(text, labels, flat_ner=False)
print("Nested NER:", [e['text'] for e in entities_nested])
# Output: ['University of California, Berkeley', 'California, Berkeley', 'California']

Multi-label ClassificationΒΆ

Allow entities to have multiple types:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Dr. Smith is a cardiologist at Mayo Clinic."
labels = ["person", "doctor", "specialist", "professional", "organization", "hospital"]

# Single label per entity (default)
entities_single = model.predict_entities(text, labels, multi_label=False)
print("Single label:")
for e in entities_single:
    print(f"  {e['text']}: {e['label']}")

# Multiple labels per entity
entities_multi = model.predict_entities(text, labels, multi_label=True)
print("\nMulti-label:")
for e in entities_multi:
    print(f"  {e['text']}: {e['label']}")

Local Models and CachingΒΆ

Loading from Local DirectoryΒΆ

from gliner import GLiNER

# Load from local directory
model = GLiNER.from_pretrained("/path/to/local/model")

# Or load from HuggingFace Hub with local cache
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    cache_dir="./model_cache"  # Cache models locally
)

Device SelectionΒΆ

from gliner import GLiNER

# Load on GPU
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    map_location="cuda"  # Use GPU
)

# Load on CPU
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    map_location="cpu"
)

# Check device
print(f"Model is on: {model.device}")

Reduced-precision loading (dtype)ΒΆ

Pass dtype to from_pretrained to load the weights directly at the target floating-point precision β€” no intermediate fp32 copy, no post-load cast:

from gliner import GLiNER
import torch

# Either a string or a torch.dtype
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", dtype="bf16", map_location="cuda")
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", dtype=torch.bfloat16, map_location="cuda")

Accepted values: "fp16" / "float16" / "half", "bf16" / "bfloat16", "fp32" / "float32" / "float", or any floating-point torch.dtype. Int/bool buffers are left untouched; non-floating dtypes (e.g. torch.int8) are rejected β€” use quantize="int8" for that path.

Why use dtype instead of quantize="bf16":

  • quantize casts after the full fp32 state dict + fp32 model are already in memory.

  • dtype casts each tensor as it is read from the safetensors file and pre-casts the model shell before load_state_dict, so a fully-fp32 snapshot never co-exists with the loaded weights. For CPU-only loads, peak host memory during load drops from ~2Γ— fp32 to ~1Γ— fp32 for bf16/fp16. For map_location="cuda", the state dict streams to GPU while the shell is CPU-side, so the saving is avoiding a simultaneous fp32 GPU state dict + fp32 GPU model β€” not quite a 2Γ—β†’1Γ— total-footprint reduction, but still a meaningful win on the GPU peak and on the separate post-load cast pass.

When it matters: cold starts and scalable serverless deployments (AWS Lambda, Cloud Run, Modal, RunPod serverless, autoscaled Kubernetes pods, etc.) β€” startup latency and peak memory directly drive cost and SLA:

  • Shorter cold-start on every new container (one pass instead of load + cast).

  • Lower peak memory lets instances fit on smaller memory tiers and reduces boot-time OOMs under memory pressure.

  • Faster first-inference latency after a scale-from-zero event.

dtype covers plain precision changes (bf16/fp16/fp32). For int8 / torchao / CPU dynamic quantization, keep using quantize (see below). The two can be combined if desired.

Skipping the random-init shell (low_cpu_mem_usage)ΒΆ

dtype= lowers peak memory but doesn’t speed up the load itself β€” even with dtype="bf16", GLiNER still allocates a fp32 random-initialized model shell, runs Kaiming/Xavier init over every parameter, casts the whole thing to bf16, then overwrites every value with the loaded weights. All of that init work is thrown away.

Pass low_cpu_mem_usage=True to skip it: the model graph is built under torch.device("meta") (shape descriptors only, no allocation, no random init), the state dict is read at the target precision, and load_state_dict(assign=True) swaps the loaded tensors directly into the meta-shell parameter slots in one pass.

model = GLiNER.from_pretrained(
    "urchade/gliner_medium-v2.1",
    dtype="bf16",
    low_cpu_mem_usage=True,
    map_location="cuda",
)

Measured on gliner_medium-v2.1 on an RTX 5090 (n=12 reps, Welch t-tested, OS page cache warmed):

path

mean load time

speedup

peak host RSS delta

baseline (cuda, bf16)

3.16 s

1.0Γ—

1361 MB

low_cpu_mem_usage=True (cuda, bf16)

1.61 s

1.96Γ—

1004 MB

baseline (cpu, bf16)

3.30 s

1.0Γ—

1597 MB

low_cpu_mem_usage=True (cpu, bf16)

1.60 s

2.06Γ—

1225 MB

baseline (cpu, fp32)

3.04 s

1.0Γ—

1598 MB

low_cpu_mem_usage=True (cpu, fp32)

1.45 s

2.10Γ—

170 MB

About 1.5 seconds saved on every cold start, plus 23–89% lower peak host RSS depending on dtype (the fp32 case is dramatic because safetensors mmaps the on-disk file and we never copy it into anonymous memory). Loaded parameters are bit-identical to the standard path β€” verified across 224 parameters and 1 buffer (position_ids, re-materialized after assign).

Default is False while the path matures β€” enable it explicitly when cold-start latency or peak host memory matters. low_cpu_mem_usage stacks with dtype= (use them together) and is independent of quantize= and compile_torch_model=.

Selective download (variant)ΒΆ

dtype= casts in memory but the on-disk file is still fp32, so the bytes pulled from the Hub don’t shrink. If a publisher uploads a half-precision variant of the file (model.fp16.safetensors or model.bf16.safetensors, following the transformers naming convention), pass variant= to download only that file:

model = GLiNER.from_pretrained("org/gliner_bf16-v1", variant="bf16")
# Halves bytes-on-the-wire vs. the default fp32 download (~745 MB -> ~370 MB
# for gliner_medium-v2.1) when a bf16 file is published.

Behavior β€” variant= is a best-effort hint, not a hard requirement:

  • variant=None (default): unchanged β€” pulls the whole repo and loads model.safetensors.

  • variant="fp16" / "bf16" and the variant is published: snapshot_download is filtered with allow_patterns so only model.{variant}.safetensors (plus configs and tokenizer assets) is fetched. dtype= is inferred from variant; passing both with mismatched precisions raises.

  • variant="fp16" / "bf16" and the variant is not published: a UserWarning is emitted and the loader falls back to the default fp32 file plus an in-memory cast β€” same outcome as passing dtype= alone, no error, no I/O win. The warning text tells the user the publisher hasn’t uploaded the file so the bandwidth savings didn’t apply.

This is the lever to pull for cold-start cost when bytes-on-the-wire dominate. Set variant="bf16" and forget about it β€” if the publisher has the variant file you get the I/O savings, and if they don’t you get the in-memory dtype= behavior with a one-line warning. The probe uses huggingface_hub.HfApi().list_repo_files (one cheap API call) before downloading.

Quantization, Compilation & FlashDeBERTaΒΆ

Combine dtype="fp16" (or "bf16") with compile_torch_model=True for up to ~1.9x faster GPU inference with zero quality loss:

from gliner import GLiNER

model = GLiNER.from_pretrained(
    "urchade/gliner_medium-v2.1",
    map_location="cuda",
    dtype="fp16",             # or "bf16" β€” see "Reduced-precision loading" above
    compile_torch_model=True,
)

Or apply after loading:

import torch
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda")
model.to(torch.float16)  # fp16 half-precision
model.compile()          # torch.compile with dynamic shapes

Compilation is especially beneficial for short sequences, where the overhead of the standard eager execution is proportionally larger. For longer sequences, FlashDeBERTa is recommended as it scales much better with sequence length.

Benchmarks (CoNLL-2003 strict F1, gliner_medium-v2.1, RTX 5090):

Condition

F1

Speedup

GPU fp32 (baseline)

0.8107

1.00x

+ dtype="fp16"

0.8107

1.35x

+ compile

0.8107

1.31x

+ dtype="fp16" + compile

0.8107

1.94x

quantize= vs dtype=:

  • dtype="fp16" / "bf16" β€” plain precision change via efficient load (see the dedicated section above). This is the only way to get half-precision inference.

  • quantize="int8" β€” real int8 quantization. On CPU, built-in FBGEMM kernels (~1.6x speedup). On GPU, torchao int8 weight-only quantization (~50% memory reduction, no speed gain). Intended for models fine-tuned with quantization-aware training (QAT); stock DeBERTa-based models lose accuracy with int8.

  • quantize= accepts only "int8" (or None). Passing True, "fp16", or "bf16" raises with a migration message β€” those were precision downcasts, not quantization, and are handled exclusively by dtype= / model.to(...) now.

Compilation notes:

  • compile_torch_model=True uses torch.compile which JIT-compiles the model via Triton kernels. The first inference call will be slower due to compilation, but all subsequent calls benefit from the compiled graph. This is only available on Linux and WSL (not native Windows or macOS).

⚑ Accelerating Inference with Sequence Packing¢

Sequence packing allows GLiNER to combine multiple short requests into a single transformer pass while keeping a block-diagonal attention mask. This drastically reduces the number of padding tokens the encoder needs to process and yields higher throughput.

  1. Configure packing once for all predictions

    from gliner import GLiNER, InferencePackingConfig
    
    model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda")
    
    packing_cfg = InferencePackingConfig(
        max_length=512,
        sep_token_id=model.data_processor.transformer_tokenizer.sep_token_id,
        streams_per_batch=1,
    )
    
    # Enable packing for every subsequent `run`/`predict_*` call.
    model.configure_inference_packing(packing_cfg)
    
    texts = ["Email CEO to approve budget", "Schedule yearly medical checkup"]
    labels = ["person", "organization", "action"]
    
    predictions = model.inference(texts, labels, batch_size=16)
    

    You can override or disable the default configuration on a per-call basis by passing packing_config=<new_cfg> or packing_config=None respectively when invoking model.inference or model.predict_entities.

  2. Benchmark the impact

    The bench/bench_gliner_e2e.py script can stress the full GLiNER pipeline in addition to encoder-only Hugging Face models:

    python bench/bench_gliner_e2e.py
    

    To isolate and measure the impact on the encoder:

    python bench/bench_infer_packing.py --batch_size 32 --scenario short_zipf
    

πŸ”Œ Usage with spaCyΒΆ

GLiNER can be seamlessly integrated with spaCy. To begin, install the gliner-spacy library via pip:

pip install gliner-spacy

Following installation, you can add GLiNER to a spaCy NLP pipeline. Here’s how to integrate it with a blank English pipeline; however, it’s compatible with any spaCy model.

import spacy
from gliner_spacy.pipeline import GlinerSpacy

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": "urchade/gliner_mediumv2.1",
    "chunk_size": 250,
    "labels": ["person", "organization", "email"],
    "style": "ent",
    "threshold": 0.3,
    "map_location": "cpu" # only available in v.0.0.7
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example text for entity detection
text = "This is a text about Bill Gates and Microsoft."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.score) # ent._.score only available in v. 0.0.7

Expected OutputΒΆ

Bill Gates => person
Microsoft => organization

πŸƒβ€β™€οΈ Using FlashDeBERTaΒΆ

Most GLiNER models use the DeBERTa encoder as their backbone. This architecture offers strong token classification performance and typically requires less data to achieve good results. However, a major drawback has been its slower inference speed, and until recently, there was no flash attention implementation compatible with DeBERTa’s disentangled attention mechanism.

To address this, FlashDeBERTa was introduced.

InstallationΒΆ

pip install flashdeberta -U

Before using FlashDeBERTa, please make sure that you have transformers>=4.51.3.

UsageΒΆ

To enable FlashDeBERTa, set the USE_FLASHDEBERTA environment variable before loading the model:

export USE_FLASHDEBERTA=1

Or set it directly in Python:

import os
os.environ["USE_FLASHDEBERTA"] = "1"

from gliner import GLiNER

# FlashDeBERTa will be used when USE_FLASHDEBERTA is set and the package is installed
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

# To explicitly use eager attention instead
model = GLiNER.from_pretrained(
    "urchade/gliner_medium-v2.1",
    _attn_implementation="eager"
)

Performance Boost: FlashDeBERTa provides up to a 3Γ— speed boost for typical sequence lengthsβ€”and even greater improvements for longer sequences.

πŸ› οΈ High-Level Pipelines {#pipelines}ΒΆ

GLiNER-Multitask models are designed to extract relevant information from plain text based on user-provided custom prompts. These encoder-based multitask models enable efficient and controllable information extraction with a single model, reducing computational and storage costs.

Supported Tasks:

  • Named Entity Recognition (NER): Identify and categorize entities

  • Relation Extraction: Detect relationships between entities

  • Summarization: Extract key sentences

  • Sentiment Extraction: Identify sentiment-bearing text spans

  • Key-Phrase Extraction: Extract important phrases and keywords

  • Question-Answering: Find answers to questions in text

  • Open Information Extraction: Extract information based on open prompts

  • Text Classification: Classify text against predefined labels

ClassificationΒΆ

The GLiNERClassifier pipeline performs text classification tasks:

from gliner import GLiNER
from gliner.multitask import GLiNERClassifier

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
classifier = GLiNERClassifier(model=model)

# Single-label classification
text = "SpaceX successfully launched a new rocket into orbit."
labels = ['science', 'technology', 'business', 'sports']

predictions = classifier(text, classes=labels, multi_label=False)
print(predictions)
# Output: [[{'label': 'technology', 'score': 0.84}]]

# Multi-label classification
predictions_multi = classifier(text, classes=labels, multi_label=True)
print(predictions_multi)
# Output: [[{'label': 'technology', 'score': 0.84}, {'label': 'science', 'score': 0.72}]]

Evaluation on Dataset:

# Evaluate on HuggingFace dataset
metrics = classifier.evaluate('dair-ai/emotion')
print(metrics)
# Output: {'micro': 0.4465, 'macro': 0.4243, 'weighted': 0.4884}

Question-AnsweringΒΆ

The GLiNERQuestionAnswerer pipeline extracts answers from text:

from gliner import GLiNER
from gliner.multitask import GLiNERQuestionAnswerer

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
answerer = GLiNERQuestionAnswerer(model=model)

# Extract answer
text = "SpaceX was founded by Elon Musk in 2002 to reduce space transportation costs."
question = "Who founded SpaceX?"

predictions = answerer(text, questions=question)
print(predictions)
# Output: [[{'answer': 'Elon Musk', 'score': 0.998}]]

# Multiple questions
questions = ["Who founded SpaceX?", "When was SpaceX founded?", "What is SpaceX's goal?"]
predictions = answerer(text, questions=questions)
for q, pred in zip(questions, predictions):
    print(f"Q: {q}")
    print(f"A: {pred[0]['answer']} (score: {pred[0]['score']:.3f})")

Evaluation on SQuAD:

from gliner.multitask import GLiNERSquadEvaluator

evaluator = GLiNERSquadEvaluator(model_id="knowledgator/gliner-multitask-large-v0.5")
metrics = evaluator.evaluate(threshold=0.25)
print(metrics)
# Output: {'exact': 29.41, 'f1': 29.80, 'total': 11873, ...}

Relation ExtractionΒΆ

The GLiNERRelationExtractor pipeline extracts relationships between entities:

from gliner import GLiNER
from gliner.multitask import GLiNERRelationExtractor

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
relation_extractor = GLiNERRelationExtractor(model=model)

# Extract relations
text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."
entities = ['person', 'company', 'year', 'goal']
relations = ['founded', 'founded_in', 'goal']

predictions = relation_extractor(
    text, 
    entities=entities, 
    relations=relations,
    threshold=0.5
)

for pred in predictions[0]:
    print(f"{pred['source']} --[{pred['relation']}]--> {pred['target']}")
    print(f"  Score: {pred['score']:.3f}")
Expected Output
Elon Musk --[founded]--> SpaceX
  Score: 0.958
SpaceX --[founded_in]--> 2002
  Score: 0.912

Open Information ExtractionΒΆ

The GLiNEROpenExtractor pipeline extracts information based on custom prompts:

from gliner import GLiNER
from gliner.multitask import GLiNEROpenExtractor

# Initialize with custom prompt
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract all companies related to space technologies"
)

# Extract information
text = """
Elon Musk founded SpaceX in 2002 to reduce space transportation costs. 
Also Elon is founder of Tesla, NeuroLink and many other companies.
"""

labels = ['company']
predictions = extractor(text, labels=labels, threshold=0.5)

for pred in predictions[0]:
    print(f"{pred['text']} (score: {pred['score']:.3f})")
Expected Output
SpaceX (score: 0.962)
Tesla (score: 0.936)
NeuroLink (score: 0.912)

Custom Prompts for Different Tasks:

# Extract product descriptions
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract product descriptions and features from the text"
)

# Extract technical specifications
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract technical specifications and requirements"
)

# Extract contact information
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract all contact information including emails and phone numbers"
)

SummarizationΒΆ

The GLiNERSummarizer pipeline extracts key sentences for summarization:

from gliner import GLiNER
from gliner.multitask import GLiNERSummarizer

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
summarizer = GLiNERSummarizer(model=model)

# Extract summary
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop 
and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, 
Gates held the positions of chairman, chief executive officer, president and chief 
software architect, while also being the largest individual shareholder until May 2014.
"""

summary = summarizer(text, threshold=0.1)
print(summary)
Expected Output
['Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop 
and sell BASIC interpreters for the Altair 8800.']

Controlling Summary Length:

# More selective (higher threshold = shorter summary)
summary_short = summarizer(text, threshold=0.5)

# More comprehensive (lower threshold = longer summary)
summary_long = summarizer(text, threshold=0.05)

Advanced Relation Extraction with UTCAΒΆ

For more nuanced control over relation extraction, use the utca framework:

InstallationΒΆ

pip install utca -U

Setting Up the PipelineΒΆ

from utca.core import RenameAttribute
from utca.implementation.predictors import GLiNERPredictor, GLiNERPredictorConfig
from utca.implementation.tasks import (
    GLiNER,
    GLiNERPreprocessor,
    GLiNERRelationExtraction,
    GLiNERRelationExtractionPreprocessor,
)

# Initialize predictor
predictor = GLiNERPredictor(
    GLiNERPredictorConfig(
        model_name="knowledgator/gliner-multitask-large-v0.5",
        device="cuda:0",  # Use "cpu" for CPU inference
    )
)

# Create pipeline
pipe = (
    GLiNER(  # Extract entities
        predictor=predictor,
        preprocess=GLiNERPreprocessor(threshold=0.7)
    )
    | RenameAttribute("output", "entities")  # Prepare for relation extraction
    | GLiNERRelationExtraction(  # Extract relations
        predictor=predictor,
        preprocess=(
            GLiNERPreprocessor(threshold=0.5)
            | GLiNERRelationExtractionPreprocessor()
        )
    )
)

Running the PipelineΒΆ

text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop 
and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, 
Gates held the positions of chairman, chief executive officer, president and chief 
software architect, while also being the largest individual shareholder until May 2014.
"""

result = pipe.run({
    "text": text,
    "labels": ["organization", "person", "position", "date"],
    "relations": [
        {
            "relation": "founder",
            "pairs_filter": [("organization", "person")],  # Only consider org-person pairs
            "distance_threshold": 100,  # Max distance between entities (in characters)
        },
        {
            "relation": "inception_date",
            "pairs_filter": [("organization", "date")],
        },
        {
            "relation": "held_position",
            "pairs_filter": [("person", "position")],
        }
    ]
})

# Display results
for relation in result["output"]:
    source = relation['source']['span']
    target = relation['target']['span']
    rel_type = relation['relation']
    score = relation['score']
    print(f"{source} --[{rel_type}]--> {target} (score: {score:.3f})")
Expected Output
Microsoft --[founder]--> Bill Gates (score: 0.968)
Microsoft --[founder]--> Paul Allen (score: 0.863)
Microsoft --[inception_date]--> April 4, 1975 (score: 0.997)
Bill Gates --[held_position]--> chairman (score: 0.966)
Bill Gates --[held_position]--> chief executive officer (score: 0.947)
Bill Gates --[held_position]--> president (score: 0.973)
Bill Gates --[held_position]--> chief software architect (score: 0.950)

Advanced UTCA FeaturesΒΆ

Distance Filtering:

# Only extract relations where entities are close together
relations = [
    {
        "relation": "works_for",
        "pairs_filter": [("person", "organization")],
        "distance_threshold": 50,  # Entities must be within 50 characters
    }
]

Multiple Relation Types:

# Define complex relation schemas
relations = [
    {
        "relation": "employed_by",
        "pairs_filter": [("person", "organization")],
    },
    {
        "relation": "located_in",
        "pairs_filter": [("organization", "location"), ("person", "location")],
    },
    {
        "relation": "acquired_by",
        "pairs_filter": [("organization", "organization")],
    },
]

Practical ExamplesΒΆ

Example 1: Extract Company InformationΒΆ

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

text = """
Apple Inc. is headquartered in Cupertino, California. The company was founded 
by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. Tim Cook is the 
current CEO. Apple's main products include iPhone, iPad, and Mac computers.
"""

labels = ["company", "location", "person", "position", "product", "date"]
entities = model.predict_entities(text, labels, threshold=0.5)

# Organize by type
from collections import defaultdict
by_type = defaultdict(list)
for entity in entities:
    by_type[entity['label']].append(entity['text'])

for label, items in by_type.items():
    print(f"{label}: {', '.join(set(items))}")

Example 2: Process Scientific PapersΒΆ

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

abstract = """
We introduce GPT-4, a large-scale multimodal model developed by OpenAI. 
The model was trained on a diverse dataset and exhibits strong performance 
on various benchmarks including MMLU, HumanEval, and GSM-8K.
"""

labels = [
    "model_name", "organization", "dataset", "benchmark", 
    "metric", "task", "method"
]

entities = model.predict_entities(abstract, labels, threshold=0.4)

print("Extracted Information:")
for entity in entities:
    print(f"  {entity['label']}: {entity['text']}")

Example 3: Analyze News ArticlesΒΆ

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")

article = """
Tesla CEO Elon Musk announced on Twitter that the company will open a new 
Gigafactory in Austin, Texas. The facility will produce the Cybertruck and 
Model Y vehicles. Construction began in July 2020 and operations started in 2021.
"""

labels = [
    "person", "position", "company", "location", "facility", 
    "product", "date", "event"
]

# Process with BiEncoder for efficiency
entities = model.predict_entities(article, labels, threshold=0.5)

# Group related entities
print("Key Information:")
print(f"- Company: {[e['text'] for e in entities if e['label'] == 'company']}")
print(f"- Location: {[e['text'] for e in entities if e['label'] == 'location']}")
print(f"- Products: {[e['text'] for e in entities if e['label'] == 'product']}")
print(f"- Timeline: {[e['text'] for e in entities if e['label'] == 'date']}")

⚑ Prompt Compression (Precomputed Prompt Embeddings)¢

For uni-encoder models (span, token, and relation-extraction variants) you can precompute the prompt embeddings for a fixed label set and reuse them at inference time. In precomputed mode the encoder receives only the text (no <<ENT>>label1<<ENT>>...<<SEP>> prefix), which shortens the input sequence, reduces attention cost, and can noticeably speed up inference β€” at a small accuracy trade-off versus re-encoding the prompts on every call.

How it worksΒΆ

BaseGLiNER.compress_prompt_embeddings(texts, labels, rel_labels=None, batch_size=8, distill=False, distill_threshold=0.3, distill_epochs=3, distill_lr=1e-5, distill_batch_size=None, distill_output_dir="./distill_ckpt", distill_train_kwargs=None):

  1. Runs the normal forward pass over (texts, labels) pairs.

  2. Extracts the per-label prompt embedding (the <<ENT>> token representation, pre-projection) from each example.

  3. Averages across all examples to produce an (L, D) matrix stored as a non-trainable parameter on the underlying model (model.precomputed_prompts).

  4. Sets config.precomputed_prompts_mode = True and writes config.id_to_classes, so subsequent predict_entities / forward calls skip prompt-prepending and look up the stored embeddings instead.

The stored embeddings travel with state_dict, so save_pretrained / from_pretrained round-trip them automatically. Training can continue after compression β€” the stored matrix is frozen but everything else keeps training.

Basic usage (entity extraction)ΒΆ

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Representative texts from your target domain. They do not need labels;
# they are only used as contexts while averaging the prompt representations.
calibration_texts = [
    "Barack Obama was born in Honolulu, Hawaii.",
    "Apple announced a new iPhone at their Cupertino headquarters.",
    # ... ideally 100–1000 diverse sentences from your domain
]

labels = ["person", "organization", "location", "date"]

# One-time compression step
model.compress_prompt_embeddings(calibration_texts, labels, batch_size=16)

# Inference now uses the precomputed prompts β€” no need to pass labels again
entities = model.predict_entities(
    "Tim Cook visited Berlin last Tuesday.",
    labels,               # must match (order-insensitive) the compressed set
    threshold=0.5,
)

# Persist the compressed model
model.save_pretrained("./gliner-compressed")

Relation extractionΒΆ

For relex models (UniEncoderSpanRelexModel / UniEncoderTokenRelexModel), pass rel_labels so the <<REL>> prompt embeddings are compressed as well:

model.compress_prompt_embeddings(
    texts=calibration_texts,
    labels=["person", "organization", "location"],
    rel_labels=["works_for", "located_in", "founder_of"],
    batch_size=8,
)

End-to-end distillationΒΆ

Compression alone can dip quality because averaged prompt embeddings drop context-specific signal. Pass distill=True to recover it in a single call: the raw (pre-compression) model first generates pseudo-labels over texts, prompts are then compressed, and the compressed model is fine-tuned on those pseudo-labels β€” no separate script required.

model.compress_prompt_embeddings(
    texts=calibration_texts,     # also used as the distillation corpus
    labels=labels,
    batch_size=16,
    distill=True,
    distill_threshold=0.3,       # pseudo-label confidence cutoff
    distill_epochs=3,
    distill_lr=1e-5,
    distill_output_dir="./distill_ckpt",
)

Relevant knobs:

  • distill_threshold: confidence cutoff used when the raw model produces pseudo-labels. Lower values widen the training signal but add noise.

  • distill_epochs, distill_lr: fine-tuning schedule.

  • distill_batch_size: defaults to batch_size if omitted.

  • distill_output_dir: forwarded to train_model.

  • distill_train_kwargs: dict of extra kwargs merged into the underlying train_model call (e.g. to override save_strategy, logging_steps, etc.).

Pseudo-labels are generated from the same texts used for compression, so one diverse in-domain corpus serves both roles.

Tips and Best PracticesΒΆ

  1. Choose the right model architecture:

    • UniEncoder: General purpose, < 30 entity types

    • BiEncoder: Many entity types (50-200+)

    • Token-level: Long entity spans

    • Relation extraction: Knowledge graph construction

  2. Optimize threshold for your use case:

    • High precision: threshold = 0.6-0.8

    • Balanced: threshold = 0.4-0.6

    • High recall: threshold = 0.2-0.4

  3. Use batch processing for multiple documents:

    • More efficient GPU utilization

    • Faster overall processing

  4. Pre-compute label embeddings (BiEncoder):

    • Cache embeddings when processing many documents

    • Significant speedup for production use

  5. Enable FlashDeBERTa:

    • ~3x speed improvement

    • No accuracy loss

  6. Use appropriate labels:

    • Specific labels work better than generic ones

    • β€œcompany” > β€œentity”

    • β€œmedication” > β€œword”

TroubleshootingΒΆ

Low AccuracyΒΆ

# Try lowering the threshold
entities = model.predict_entities(text, labels, threshold=0.3)

# Use more specific labels
labels = ["tech_company", "software_product", "founder"]  # Specific
# instead of
labels = ["organization", "thing", "person"]  # Too generic

# Try a larger model
model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")

Slow InferenceΒΆ

# Enable FlashDeBERTa
# pip install flashdeberta

# Compile model
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    compile_torch_model=True
)

# Use batch processing
entities_batch = model.inference(texts, labels, batch_size=16)

# For BiEncoder: pre-compute embeddings
label_embeds = model.encode_labels(labels)
entities = model.predict_with_embeds(text, label_embeds, labels)

Out of MemoryΒΆ

# Reduce batch size
entities = model.inference(texts, labels, batch_size=4)

# Use a smaller model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Process on CPU
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    map_location="cpu"
)