Advanced Usage

🚀 Basic Use Case

After installing the GLiNER library, import the GLiNER class. You can load your chosen model with GLiNER.from_pretrained and use inference to identify entities within your text.

from gliner import GLiNER

# Load a GLiNER model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Sample text for entity prediction
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; 
born 5 February 1985) is a Portuguese professional footballer who plays as a forward for 
and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely 
regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or 
awards, a record three UEFA Men's Player of the Year Awards, and four European Golden 
Shoes, the most by a European player.
"""

# Define labels for entity extraction
labels = ["person", "award", "date", "teams", "competition"]

# Perform entity prediction
entities = model.predict_entities(text, labels, threshold=0.5)

# Display predicted entities and their labels
for entity in entities:
    print(entity["text"], "=>", entity["label"])
Expected Output
Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award

Understanding the Output

Each predicted entity is a dictionary with the following structure:

{
    'start': int,      # Start character position in text
    'end': int,        # End character position in text
    'text': str,       # Extracted text span
    'label': str,      # Predicted entity type
    'score': float     # Confidence score (0-1)
}

Example:

for entity in entities:
    print(f"Text: {entity['text']}")
    print(f"Label: {entity['label']}")
    print(f"Score: {entity['score']:.3f}")
    print(f"Position: [{entity['start']}:{entity['end']}]")
    print("---")

Batch Processing

For processing multiple texts efficiently, use the inference method:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Multiple texts to process
texts = [
    "Apple Inc. was founded by Steve Jobs in Cupertino, California.",
    "Google LLC is headquartered in Mountain View.",
    "Amazon was started by Jeff Bezos in Seattle."
]

labels = ["organization", "person", "location"]

# Process all texts at once
all_entities = model.inference(texts, labels, batch_size=3, threshold=0.5)

# Display results for each text
for i, entities in enumerate(all_entities):
    print(f"\nText {i+1}: {texts[i]}")
    print("Entities:")
    for entity in entities:
        print(f"  - {entity['text']} ({entity['label']}): {entity['score']:.2f}")

Benefits of Batch Processing:

  • Faster: Process multiple texts in parallel

  • Efficient: Better GPU utilization

  • Scalable: Handle large document collections

Using Different Model Architectures

GLiNER supports multiple architecture variants, each optimized for different scenarios.

UniEncoder Models (Standard)

Best for general-purpose NER with up to ~30 entity types:

from gliner import GLiNER

# Load a standard UniEncoder model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

text = "Apple Inc. was founded by Steve Jobs in 1976."
labels = ["company", "person", "date"]

entities = model.predict_entities(text, labels)
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")

BiEncoder Models (Scalable)

Best for handling many entity types (50-200+) with pre-computed label embeddings:

from gliner import GLiNER

# Load a BiEncoder model
model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")

# BiEncoders handle many entity types efficiently
labels = [
    "person", "organization", "location", "date", "product", "event",
    "technology", "software", "hardware", "programming_language",
    "framework", "library", "database", "protocol", "standard",
    # ... can handle 100+ types efficiently
]

text = "Python is a programming language created by Guido van Rossum."
entities = model.predict_entities(text, labels)

# For production: pre-compute label embeddings
label_embeddings = model.encode_labels(labels, batch_size=16)

# Then use cached embeddings for faster inference
entities = model.predict_with_embeds(
    text, 
    label_embeddings, 
    labels,
    threshold=0.5
)

BiEncoder Advantages:

  • Handle 100+ entity types without performance degradation

  • Pre-compute label embeddings once, reuse across documents

  • Faster inference when processing many documents with same entity types

Token-Level Models

Best for extracting long entity spans (multi-sentence entities, summaries):

from gliner import GLiNER

# Load a token-level model
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

# Token-level models excel at long entities
text = """
The European Union is a political and economic union of 27 member states 
that are located primarily in Europe. The EU has developed an internal 
single market through a standardised system of laws.
"""

labels = ["organization", "number", "location", "concept"]

entities = model.predict_entities(text, labels)
for entity in entities:
    print(f"{entity['text'][:50]}... => {entity['label']}")

Relation Extraction Models

Extract both entities and relationships between them:

from gliner import GLiNER

# Load a relation extraction model
model = GLiNER.from_pretrained("knowledgator/gliner-relex-large-v0.5")

text = "Bill Gates founded Microsoft in 1975. The company is headquartered in Redmond."

# Define entity types and relation types
entity_labels = ["person", "organization", "date", "location"]
relation_labels = ["founded", "founded_in", "headquartered_in"]

# Extract entities and relations
entities, relations = model.inference(
    [text],
    labels=entity_labels,
    relations=relation_labels,
    threshold=0.5,
    relation_threshold=0.5
)

# Display entities
print("Entities:")
for entity in entities[0]:
    print(f"  {entity['text']} ({entity['label']})")

# Display relations
print("\nRelations:")
for relation in relations[0]:
    head = entities[0][relation['head']['entity_idx']]
    tail = entities[0][relation['tail']['entity_idx']]
    print(f"  {head['text']} --[{relation['relation']}]--> {tail['text']}")
Expected Output
Entities:
  Bill Gates (person)
  Microsoft (organization)
  1975 (date)
  Redmond (location)

Relations:
  Bill Gates --[founded]--> Microsoft
  Microsoft --[founded_in]--> 1975
  Microsoft --[headquartered_in]--> Redmond

Advanced Configuration

Adjusting the Threshold

Control the precision-recall tradeoff:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. is a technology company."
labels = ["company", "industry"]

# High threshold: Higher precision, lower recall
entities_high = model.predict_entities(text, labels, threshold=0.7)
print(f"High threshold (0.7): {len(entities_high)} entities")

# Low threshold: Lower precision, higher recall
entities_low = model.predict_entities(text, labels, threshold=0.3)
print(f"Low threshold (0.3): {len(entities_low)} entities")

# Default threshold
entities_default = model.predict_entities(text, labels)  # threshold=0.5
print(f"Default threshold (0.5): {len(entities_default)} entities")

Relation extraction model also has two additional threshold parameters:

  • adjacency_threshold: Confidence threshold for adjacency matrix reconstruction (defaults to threshold).

  • relation_threshold: Confidence threshold for relations (defaults to threshold).

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. is a technology company founded in 1976."
labels = ["company", "industry", "date"]
relations = ["founded in"]

results = model.predict_entities(text, labels, relations=relations, threshold=0.3, adjacency_threshold=0.25, relation_threshold=0.7)

Use a lower adjacency threshold so the model can rerank and classify more pairs of entities that may be linked. Set a higher relations threshold for more specificity and better precision. Feel free to adapt all three thresholds based on your use case.### Flat vs Nested NER

Control whether entities can overlap:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "The University of California, Berkeley is located in California."
labels = ["university", "location"]

# Flat NER: No overlapping entities (default)
entities_flat = model.predict_entities(text, labels, flat_ner=True)
print("Flat NER:", [e['text'] for e in entities_flat])
# Output: ['University of California, Berkeley', 'California']

# Nested NER: Allow overlapping entities
entities_nested = model.predict_entities(text, labels, flat_ner=False)
print("Nested NER:", [e['text'] for e in entities_nested])
# Output: ['University of California, Berkeley', 'California, Berkeley', 'California']

Multi-label Classification

Allow entities to have multiple types:

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Dr. Smith is a cardiologist at Mayo Clinic."
labels = ["person", "doctor", "specialist", "professional", "organization", "hospital"]

# Single label per entity (default)
entities_single = model.predict_entities(text, labels, multi_label=False)
print("Single label:")
for e in entities_single:
    print(f"  {e['text']}: {e['label']}")

# Multiple labels per entity
entities_multi = model.predict_entities(text, labels, multi_label=True)
print("\nMulti-label:")
for e in entities_multi:
    print(f"  {e['text']}: {e['label']}")

Local Models and Caching

Loading from Local Directory

from gliner import GLiNER

# Load from local directory
model = GLiNER.from_pretrained("/path/to/local/model")

# Or load from HuggingFace Hub with local cache
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    cache_dir="./model_cache"  # Cache models locally
)

Device Selection

from gliner import GLiNER

# Load on GPU
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    map_location="cuda"  # Use GPU
)

# Load on CPU
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    map_location="cpu"
)

# Check device
print(f"Model is on: {model.device}")

Reduced-precision loading (dtype)

Pass dtype to from_pretrained to load the weights directly at the target floating-point precision — no intermediate fp32 copy, no post-load cast:

from gliner import GLiNER
import torch

# Either a string or a torch.dtype
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", dtype="bf16", map_location="cuda")
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", dtype=torch.bfloat16, map_location="cuda")

Accepted values: "fp16" / "float16" / "half", "bf16" / "bfloat16", "fp32" / "float32" / "float", or any floating-point torch.dtype. Int/bool buffers are left untouched; non-floating dtypes (e.g. torch.int8) are rejected — use quantize="int8" for that path.

Why use dtype instead of quantize="bf16":

  • quantize casts after the full fp32 state dict + fp32 model are already in memory.

  • dtype casts each tensor as it is read from the safetensors file and pre-casts the model shell before load_state_dict, so a fully-fp32 snapshot never co-exists with the loaded weights. For CPU-only loads, peak host memory during load drops from ~2× fp32 to ~1× fp32 for bf16/fp16. For map_location="cuda", the state dict streams to GPU while the shell is CPU-side, so the saving is avoiding a simultaneous fp32 GPU state dict + fp32 GPU model — not quite a 2×→1× total-footprint reduction, but still a meaningful win on the GPU peak and on the separate post-load cast pass.

When it matters: cold starts and scalable serverless deployments (AWS Lambda, Cloud Run, Modal, RunPod serverless, autoscaled Kubernetes pods, etc.) — startup latency and peak memory directly drive cost and SLA:

  • Shorter cold-start on every new container (one pass instead of load + cast).

  • Lower peak memory lets instances fit on smaller memory tiers and reduces boot-time OOMs under memory pressure.

  • Faster first-inference latency after a scale-from-zero event.

dtype covers plain precision changes (bf16/fp16/fp32). For int8 / torchao / CPU dynamic quantization, keep using quantize (see below). The two can be combined if desired.

Skipping the random-init shell (low_cpu_mem_usage)

dtype= lowers peak memory but doesn’t speed up the load itself — even with dtype="bf16", GLiNER still allocates a fp32 random-initialized model shell, runs Kaiming/Xavier init over every parameter, casts the whole thing to bf16, then overwrites every value with the loaded weights. All of that init work is thrown away.

Pass low_cpu_mem_usage=True to skip it: the model graph is built under torch.device("meta") (shape descriptors only, no allocation, no random init), the state dict is read at the target precision, and load_state_dict(assign=True) swaps the loaded tensors directly into the meta-shell parameter slots in one pass.

model = GLiNER.from_pretrained(
    "urchade/gliner_medium-v2.1",
    dtype="bf16",
    low_cpu_mem_usage=True,
    map_location="cuda",
)

Measured on gliner_medium-v2.1 on an RTX 5090 (n=12 reps, Welch t-tested, OS page cache warmed):

path

mean load time

speedup

peak host RSS delta

baseline (cuda, bf16)

3.16 s

1.0×

1361 MB

low_cpu_mem_usage=True (cuda, bf16)

1.61 s

1.96×

1004 MB

baseline (cpu, bf16)

3.30 s

1.0×

1597 MB

low_cpu_mem_usage=True (cpu, bf16)

1.60 s

2.06×

1225 MB

baseline (cpu, fp32)

3.04 s

1.0×

1598 MB

low_cpu_mem_usage=True (cpu, fp32)

1.45 s

2.10×

170 MB

About 1.5 seconds saved on every cold start, plus 23–89% lower peak host RSS depending on dtype (the fp32 case is dramatic because safetensors mmaps the on-disk file and we never copy it into anonymous memory). Loaded parameters are bit-identical to the standard path — verified across 224 parameters and 1 buffer (position_ids, re-materialized after assign).

Default is False while the path matures — enable it explicitly when cold-start latency or peak host memory matters. low_cpu_mem_usage stacks with dtype= (use them together) and is independent of quantize= and compile_torch_model=.

Selective download (variant)

dtype= casts in memory but the on-disk file is still fp32, so the bytes pulled from the Hub don’t shrink. If a publisher uploads a half-precision variant of the file (model.fp16.safetensors or model.bf16.safetensors, following the transformers naming convention), pass variant= to download only that file:

model = GLiNER.from_pretrained("org/gliner_bf16-v1", variant="bf16")
# Halves bytes-on-the-wire vs. the default fp32 download (~745 MB -> ~370 MB
# for gliner_medium-v2.1) when a bf16 file is published.

Behavior — variant= is a best-effort hint, not a hard requirement:

  • variant=None (default): unchanged — pulls the whole repo and loads model.safetensors.

  • variant="fp16" / "bf16" and the variant is published: snapshot_download is filtered with allow_patterns so only model.{variant}.safetensors (plus configs and tokenizer assets) is fetched. dtype= is inferred from variant; passing both with mismatched precisions raises.

  • variant="fp16" / "bf16" and the variant is not published: a UserWarning is emitted and the loader falls back to the default fp32 file plus an in-memory cast — same outcome as passing dtype= alone, no error, no I/O win. The warning text tells the user the publisher hasn’t uploaded the file so the bandwidth savings didn’t apply.

This is the lever to pull for cold-start cost when bytes-on-the-wire dominate. Set variant="bf16" and forget about it — if the publisher has the variant file you get the I/O savings, and if they don’t you get the in-memory dtype= behavior with a one-line warning. The probe uses huggingface_hub.HfApi().list_repo_files (one cheap API call) before downloading.

Quantization, Compilation & FlashDeBERTa

Combine dtype="fp16" (or "bf16") with compile_torch_model=True for up to ~1.9x faster GPU inference with zero quality loss:

from gliner import GLiNER

model = GLiNER.from_pretrained(
    "urchade/gliner_medium-v2.1",
    map_location="cuda",
    dtype="fp16",             # or "bf16" — see "Reduced-precision loading" above
    compile_torch_model=True,
)

Or apply after loading:

import torch
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda")
model.to(torch.float16)  # fp16 half-precision
model.compile()          # torch.compile with dynamic shapes

Compilation is especially beneficial for short sequences, where the overhead of the standard eager execution is proportionally larger. For longer sequences, FlashDeBERTa is recommended as it scales much better with sequence length.

Benchmarks (CoNLL-2003 strict F1, gliner_medium-v2.1, RTX 5090):

Condition

F1

Speedup

GPU fp32 (baseline)

0.8107

1.00x

+ dtype="fp16"

0.8107

1.35x

+ compile

0.8107

1.31x

+ dtype="fp16" + compile

0.8107

1.94x

quantize= vs dtype=:

  • dtype="fp16" / "bf16" — plain precision change via efficient load (see the dedicated section above). This is the only way to get half-precision inference.

  • quantize="int8" — real int8 quantization. On CPU, built-in FBGEMM kernels (~1.6x speedup). On GPU, torchao int8 weight-only quantization (~50% memory reduction, no speed gain). Intended for models fine-tuned with quantization-aware training (QAT); stock DeBERTa-based models lose accuracy with int8.

  • quantize= accepts only "int8" (or None). Passing True, "fp16", or "bf16" raises with a migration message — those were precision downcasts, not quantization, and are handled exclusively by dtype= / model.to(...) now.

Compilation notes:

  • compile_torch_model=True uses torch.compile which JIT-compiles the model via Triton kernels. The first inference call will be slower due to compilation, but all subsequent calls benefit from the compiled graph. This is only available on Linux and WSL (not native Windows or macOS).

⚡ Accelerating Inference with Sequence Packing

Sequence packing allows GLiNER to combine multiple short requests into a single transformer pass while keeping a block-diagonal attention mask. This drastically reduces the number of padding tokens the encoder needs to process and yields higher throughput.

  1. Configure packing once for all predictions

    from gliner import GLiNER, InferencePackingConfig
    
    model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda")
    
    packing_cfg = InferencePackingConfig(
        max_length=512,
        sep_token_id=model.data_processor.transformer_tokenizer.sep_token_id,
        streams_per_batch=1,
    )
    
    # Enable packing for every subsequent `run`/`predict_*` call.
    model.configure_inference_packing(packing_cfg)
    
    texts = ["Email CEO to approve budget", "Schedule yearly medical checkup"]
    labels = ["person", "organization", "action"]
    
    predictions = model.inference(texts, labels, batch_size=16)
    

    You can override or disable the default configuration on a per-call basis by passing packing_config=<new_cfg> or packing_config=None respectively when invoking model.inference or model.predict_entities.

  2. Benchmark the impact

    The bench/bench_gliner_e2e.py script can stress the full GLiNER pipeline in addition to encoder-only Hugging Face models:

    python bench/bench_gliner_e2e.py
    

    To isolate and measure the impact on the encoder:

    python bench/bench_infer_packing.py --batch_size 32 --scenario short_zipf
    

🔌 Usage with spaCy

GLiNER can be seamlessly integrated with spaCy. To begin, install the gliner-spacy library via pip:

pip install gliner-spacy

Following installation, you can add GLiNER to a spaCy NLP pipeline. Here’s how to integrate it with a blank English pipeline; however, it’s compatible with any spaCy model.

import spacy
from gliner_spacy.pipeline import GlinerSpacy

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": "urchade/gliner_mediumv2.1",
    "chunk_size": 250,
    "labels": ["person", "organization", "email"],
    "style": "ent",
    "threshold": 0.3,
    "map_location": "cpu" # only available in v.0.0.7
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example text for entity detection
text = "This is a text about Bill Gates and Microsoft."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.score) # ent._.score only available in v. 0.0.7

Expected Output

Bill Gates => person
Microsoft => organization

🏃‍♀️ Using FlashDeBERTa

Most GLiNER models use the DeBERTa encoder as their backbone. This architecture offers strong token classification performance and typically requires less data to achieve good results. However, a major drawback has been its slower inference speed, and until recently, there was no flash attention implementation compatible with DeBERTa’s disentangled attention mechanism.

To address this, FlashDeBERTa was introduced.

Installation

pip install flashdeberta -U

Before using FlashDeBERTa, please make sure that you have transformers>=4.51.3.

Usage

To enable FlashDeBERTa, set the USE_FLASHDEBERTA environment variable before loading the model:

export USE_FLASHDEBERTA=1

Or set it directly in Python:

import os
os.environ["USE_FLASHDEBERTA"] = "1"

from gliner import GLiNER

# FlashDeBERTa will be used when USE_FLASHDEBERTA is set and the package is installed
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")

# To explicitly use eager attention instead
model = GLiNER.from_pretrained(
    "urchade/gliner_medium-v2.1",
    _attn_implementation="eager"
)

Performance Boost: FlashDeBERTa provides up to a 3× speed boost for typical sequence lengths—and even greater improvements for longer sequences.

🛠️ High-Level Pipelines {#pipelines}

GLiNER-Multitask models are designed to extract relevant information from plain text based on user-provided custom prompts. These encoder-based multitask models enable efficient and controllable information extraction with a single model, reducing computational and storage costs.

Supported Tasks:

  • Named Entity Recognition (NER): Identify and categorize entities

  • Relation Extraction: Detect relationships between entities

  • Summarization: Extract key sentences

  • Sentiment Extraction: Identify sentiment-bearing text spans

  • Key-Phrase Extraction: Extract important phrases and keywords

  • Question-Answering: Find answers to questions in text

  • Open Information Extraction: Extract information based on open prompts

  • Text Classification: Classify text against predefined labels

Classification

The GLiNERClassifier pipeline performs text classification tasks:

from gliner import GLiNER
from gliner.multitask import GLiNERClassifier

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
classifier = GLiNERClassifier(model=model)

# Single-label classification
text = "SpaceX successfully launched a new rocket into orbit."
labels = ['science', 'technology', 'business', 'sports']

predictions = classifier(text, classes=labels, multi_label=False)
print(predictions)
# Output: [[{'label': 'technology', 'score': 0.84}]]

# Multi-label classification
predictions_multi = classifier(text, classes=labels, multi_label=True)
print(predictions_multi)
# Output: [[{'label': 'technology', 'score': 0.84}, {'label': 'science', 'score': 0.72}]]

Evaluation on Dataset:

# Evaluate on HuggingFace dataset
metrics = classifier.evaluate('dair-ai/emotion')
print(metrics)
# Output: {'micro': 0.4465, 'macro': 0.4243, 'weighted': 0.4884}

Question-Answering

The GLiNERQuestionAnswerer pipeline extracts answers from text:

from gliner import GLiNER
from gliner.multitask import GLiNERQuestionAnswerer

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
answerer = GLiNERQuestionAnswerer(model=model)

# Extract answer
text = "SpaceX was founded by Elon Musk in 2002 to reduce space transportation costs."
question = "Who founded SpaceX?"

predictions = answerer(text, questions=question)
print(predictions)
# Output: [[{'answer': 'Elon Musk', 'score': 0.998}]]

# Multiple questions
questions = ["Who founded SpaceX?", "When was SpaceX founded?", "What is SpaceX's goal?"]
predictions = answerer(text, questions=questions)
for q, pred in zip(questions, predictions):
    print(f"Q: {q}")
    print(f"A: {pred[0]['answer']} (score: {pred[0]['score']:.3f})")

Evaluation on SQuAD:

from gliner.multitask import GLiNERSquadEvaluator

evaluator = GLiNERSquadEvaluator(model_id="knowledgator/gliner-multitask-large-v0.5")
metrics = evaluator.evaluate(threshold=0.25)
print(metrics)
# Output: {'exact': 29.41, 'f1': 29.80, 'total': 11873, ...}

Relation Extraction

The GLiNERRelationExtractor pipeline extracts relationships between entities:

from gliner import GLiNER
from gliner.multitask import GLiNERRelationExtractor

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
relation_extractor = GLiNERRelationExtractor(model=model)

# Extract relations
text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."
entities = ['person', 'company', 'year', 'goal']
relations = ['founded', 'founded_in', 'goal']

predictions = relation_extractor(
    text, 
    entities=entities, 
    relations=relations,
    threshold=0.5
)

for pred in predictions[0]:
    print(f"{pred['source']} --[{pred['relation']}]--> {pred['target']}")
    print(f"  Score: {pred['score']:.3f}")
Expected Output
Elon Musk --[founded]--> SpaceX
  Score: 0.958
SpaceX --[founded_in]--> 2002
  Score: 0.912

Open Information Extraction

The GLiNEROpenExtractor pipeline extracts information based on custom prompts:

from gliner import GLiNER
from gliner.multitask import GLiNEROpenExtractor

# Initialize with custom prompt
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract all companies related to space technologies"
)

# Extract information
text = """
Elon Musk founded SpaceX in 2002 to reduce space transportation costs. 
Also Elon is founder of Tesla, NeuroLink and many other companies.
"""

labels = ['company']
predictions = extractor(text, labels=labels, threshold=0.5)

for pred in predictions[0]:
    print(f"{pred['text']} (score: {pred['score']:.3f})")
Expected Output
SpaceX (score: 0.962)
Tesla (score: 0.936)
NeuroLink (score: 0.912)

Custom Prompts for Different Tasks:

# Extract product descriptions
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract product descriptions and features from the text"
)

# Extract technical specifications
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract technical specifications and requirements"
)

# Extract contact information
extractor = GLiNEROpenExtractor(
    model=model,
    prompt="Extract all contact information including emails and phone numbers"
)

Summarization

The GLiNERSummarizer pipeline extracts key sentences for summarization:

from gliner import GLiNER
from gliner.multitask import GLiNERSummarizer

# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
summarizer = GLiNERSummarizer(model=model)

# Extract summary
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop 
and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, 
Gates held the positions of chairman, chief executive officer, president and chief 
software architect, while also being the largest individual shareholder until May 2014.
"""

summary = summarizer(text, threshold=0.1)
print(summary)
Expected Output
['Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop 
and sell BASIC interpreters for the Altair 8800.']

Controlling Summary Length:

# More selective (higher threshold = shorter summary)
summary_short = summarizer(text, threshold=0.5)

# More comprehensive (lower threshold = longer summary)
summary_long = summarizer(text, threshold=0.05)

Advanced Relation Extraction with UTCA

For more nuanced control over relation extraction, use the utca framework:

Installation

pip install utca -U

Setting Up the Pipeline

from utca.core import RenameAttribute
from utca.implementation.predictors import GLiNERPredictor, GLiNERPredictorConfig
from utca.implementation.tasks import (
    GLiNER,
    GLiNERPreprocessor,
    GLiNERRelationExtraction,
    GLiNERRelationExtractionPreprocessor,
)

# Initialize predictor
predictor = GLiNERPredictor(
    GLiNERPredictorConfig(
        model_name="knowledgator/gliner-multitask-large-v0.5",
        device="cuda:0",  # Use "cpu" for CPU inference
    )
)

# Create pipeline
pipe = (
    GLiNER(  # Extract entities
        predictor=predictor,
        preprocess=GLiNERPreprocessor(threshold=0.7)
    )
    | RenameAttribute("output", "entities")  # Prepare for relation extraction
    | GLiNERRelationExtraction(  # Extract relations
        predictor=predictor,
        preprocess=(
            GLiNERPreprocessor(threshold=0.5)
            | GLiNERRelationExtractionPreprocessor()
        )
    )
)

Running the Pipeline

text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop 
and sell BASIC interpreters for the Altair 8800. During his career at Microsoft, 
Gates held the positions of chairman, chief executive officer, president and chief 
software architect, while also being the largest individual shareholder until May 2014.
"""

result = pipe.run({
    "text": text,
    "labels": ["organization", "person", "position", "date"],
    "relations": [
        {
            "relation": "founder",
            "pairs_filter": [("organization", "person")],  # Only consider org-person pairs
            "distance_threshold": 100,  # Max distance between entities (in characters)
        },
        {
            "relation": "inception_date",
            "pairs_filter": [("organization", "date")],
        },
        {
            "relation": "held_position",
            "pairs_filter": [("person", "position")],
        }
    ]
})

# Display results
for relation in result["output"]:
    source = relation['source']['span']
    target = relation['target']['span']
    rel_type = relation['relation']
    score = relation['score']
    print(f"{source} --[{rel_type}]--> {target} (score: {score:.3f})")
Expected Output
Microsoft --[founder]--> Bill Gates (score: 0.968)
Microsoft --[founder]--> Paul Allen (score: 0.863)
Microsoft --[inception_date]--> April 4, 1975 (score: 0.997)
Bill Gates --[held_position]--> chairman (score: 0.966)
Bill Gates --[held_position]--> chief executive officer (score: 0.947)
Bill Gates --[held_position]--> president (score: 0.973)
Bill Gates --[held_position]--> chief software architect (score: 0.950)

Advanced UTCA Features

Distance Filtering:

# Only extract relations where entities are close together
relations = [
    {
        "relation": "works_for",
        "pairs_filter": [("person", "organization")],
        "distance_threshold": 50,  # Entities must be within 50 characters
    }
]

Multiple Relation Types:

# Define complex relation schemas
relations = [
    {
        "relation": "employed_by",
        "pairs_filter": [("person", "organization")],
    },
    {
        "relation": "located_in",
        "pairs_filter": [("organization", "location"), ("person", "location")],
    },
    {
        "relation": "acquired_by",
        "pairs_filter": [("organization", "organization")],
    },
]

Practical Examples

Compliance & PII Redaction

Detect and mask personal data across documents using GLiNER’s multilingual PII model, which covers 40+ entity types (SSN, credit cards, passports, emails, IBANs, etc.) in 100+ languages.

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_multi_pii-v1")

text = """
Patient John Smith (DOB: 03/15/1982, SSN: 123-45-6789) was seen at
Mayo Clinic on January 10, 2024. Contact: john.smith@email.com,
+1-555-867-5309. Insurance ID: BC-9876543. His home address is
742 Evergreen Terrace, Springfield, IL 62704.
"""

pii_labels = [
    "person", "date of birth", "social security number", "email",
    "phone number", "medical facility", "insurance id", "address",
]

entities = model.predict_entities(text, pii_labels, threshold=0.5)

# Redact PII from text
redacted = text
for entity in sorted(entities, key=lambda e: e["start"], reverse=True):
    redacted = redacted[: entity["start"]] + f"[{entity['label'].upper()}]" + redacted[entity["end"] :]

print(redacted)
Expected Output
Patient [PERSON] (DOB: [DATE OF BIRTH], SSN: [SOCIAL SECURITY NUMBER]) was seen at
[MEDICAL FACILITY] on January 10, 2024. Contact: [EMAIL],
[PHONE NUMBER]. Insurance ID: [INSURANCE ID]. His home address is
[ADDRESS].

Knowledge Graph Construction

Jointly extract entities and relations in a single pass to build knowledge graphs for Graph RAG, semantic search, and analytics.

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-relex-large-v1.0")

text = """
Elon Musk founded SpaceX in 2002 in Hawthorne, California. The company
developed the Falcon 9 rocket and the Dragon spacecraft. SpaceX was awarded
a $1.6 billion NASA contract for cargo resupply missions to the International
Space Station.
"""

entity_labels = ["person", "organization", "date", "location", "product", "monetary value"]
relation_labels = ["founded", "founded_in", "headquartered_in", "developed", "awarded_by"]

entities, relations = model.inference(
    [text],
    labels=entity_labels,
    relations=relation_labels,
    threshold=0.5,
    relation_threshold=0.5,
)

print("Entities:")
for entity in entities[0]:
    print(f"  {entity['text']} ({entity['label']})")

print("\nRelations (knowledge graph edges):")
for relation in relations[0]:
    head = entities[0][relation["head"]["entity_idx"]]
    tail = entities[0][relation["tail"]["entity_idx"]]
    print(f"  {head['text']} --[{relation['relation']}]--> {tail['text']}")

Large-Scale Entity Extraction

Use the bi-encoder to tag millions of documents against hundreds of entity types. Pre-compute label embeddings once and reuse them across all documents for maximum throughput.

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-bi-base-v2.0", map_location="cuda")

labels = [
    "person", "organization", "location", "date", "product", "event",
    "technology", "software", "programming_language", "framework",
    "database", "protocol", "standard", "regulation", "currency",
    "measurement", "chemical_compound", "disease", "medication",
    # ... scale to hundreds of types
]

# Pre-compute label embeddings once
label_embeddings = model.encode_labels(labels, batch_size=32)

# Process a large document collection efficiently
documents = [
    "Python 3.12 was released by the PSF in October 2023.",
    "The FDA approved a new treatment for Type 2 diabetes.",
    "Tesla announced record Q4 revenue of $25.2 billion.",
    # ... millions of documents
]

all_entities = model.batch_predict_with_embeds(
    documents,
    label_embeddings,
    labels,
    threshold=0.5,
    batch_size=64,
)

for doc, entities in zip(documents, all_entities):
    print(f"\n{doc[:60]}...")
    for entity in entities:
        print(f"  {entity['text']} => {entity['label']} ({entity['score']:.2f})")

Domain-Specific NER

Fine-tune GLiNER on your specialized corpus (biomedical, legal, financial, etc.) with minimal labeled data to get high-quality extraction for domain terms.

from gliner import GLiNER

# Start from a pre-trained model
model = GLiNER.from_pretrained("gliner-community/gliner_small-v2.5")

# Prepare domain-specific training data (NER format)
train_data = [
    {
        "tokenized_text": ["Aspirin", "reduces", "inflammation", "in", "rheumatoid", "arthritis", "patients", "."],
        "ner": [
            [0, 0, "medication"],
            [2, 2, "condition"],
            [4, 5, "disease"],
        ],
    },
    {
        "tokenized_text": ["Metformin", "is", "prescribed", "for", "Type", "2", "diabetes", "."],
        "ner": [
            [0, 0, "medication"],
            [4, 6, "disease"],
        ],
    },
    # ... add more examples from your domain
]

# Fine-tune — even 50–200 examples can yield strong results
model.train_model(
    train_dataset=train_data,
    output_dir="models/bio-gliner",
    max_steps=500,
    per_device_train_batch_size=8,
    learning_rate=1e-5,
    bf16=True,
)

# Use the fine-tuned model
text = "The patient was started on Lisinopril 10mg for hypertension."
entities = model.predict_entities(text, ["medication", "dosage", "disease"], threshold=0.5)
for entity in entities:
    print(f"  {entity['text']} => {entity['label']}")

For detailed training guides, see the training documentation.

Multi-lingual Information Extraction

Extract structured data from 100+ languages with a single model — no per-language setup required.

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_multi-v2.1")

texts = {
    "English": "Barack Obama was born in Honolulu, Hawaii on August 4, 1961.",
    "French": "Emmanuel Macron est le président de la République française depuis 2017.",
    "Japanese": "東京都は日本の首都であり、2021年にオリンピックが開催されました。",
    "Arabic": "تأسست شركة أرامكو السعودية في عام 1933 في المملكة العربية السعودية.",
}

labels = ["person", "location", "organization", "date"]

for lang, text in texts.items():
    entities = model.predict_entities(text, labels, threshold=0.5)
    print(f"\n{lang}: {text[:60]}...")
    for entity in entities:
        print(f"  {entity['text']} => {entity['label']}")

Search & Retrieval Augmentation

Parse user queries into structured entities to improve search relevance and RAG pipelines — route queries, filter results, or enrich retrieval context.

from gliner import GLiNER

model = GLiNER.from_pretrained("gliner-community/gliner_small-v2.5")

queries = [
    "What were Apple's revenue numbers in Q3 2023?",
    "Find clinical trials for Alzheimer's treatment in Europe",
    "Show me Python machine learning libraries released after 2022",
]

query_labels = [
    "company", "metric", "time_period", "disease", "treatment",
    "location", "programming_language", "topic", "product_type",
]

for query in queries:
    entities = model.predict_entities(query, query_labels, threshold=0.4)
    print(f"\nQuery: {query}")

    # Build structured filters from extracted entities
    filters = {entity["label"]: entity["text"] for entity in entities}
    print(f"  Extracted filters: {filters}")

    # Use filters to enhance retrieval
    # e.g., add metadata filters to your vector DB query,
    # or expand the search with entity synonyms

⚡ Prompt Compression (Precomputed Prompt Embeddings)

For uni-encoder models (span, token, and relation-extraction variants) you can precompute the prompt embeddings for a fixed label set and reuse them at inference time. In precomputed mode the encoder receives only the text (no <<ENT>>label1<<ENT>>...<<SEP>> prefix), which shortens the input sequence, reduces attention cost, and can noticeably speed up inference — at a small accuracy trade-off versus re-encoding the prompts on every call.

How it works

BaseGLiNER.compress_prompt_embeddings(texts, labels, rel_labels=None, batch_size=8, distill=False, distill_threshold=0.3, distill_epochs=3, distill_lr=1e-5, distill_batch_size=None, distill_output_dir="./distill_ckpt", distill_train_kwargs=None):

  1. Runs the normal forward pass over (texts, labels) pairs.

  2. Extracts the per-label prompt embedding (the <<ENT>> token representation, pre-projection) from each example.

  3. Averages across all examples to produce an (L, D) matrix stored as a non-trainable parameter on the underlying model (model.precomputed_prompts).

  4. Sets config.precomputed_prompts_mode = True and writes config.id_to_classes, so subsequent predict_entities / forward calls skip prompt-prepending and look up the stored embeddings instead.

The stored embeddings travel with state_dict, so save_pretrained / from_pretrained round-trip them automatically. Training can continue after compression — the stored matrix is frozen but everything else keeps training.

Basic usage (entity extraction)

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Representative texts from your target domain. They do not need labels;
# they are only used as contexts while averaging the prompt representations.
calibration_texts = [
    "Barack Obama was born in Honolulu, Hawaii.",
    "Apple announced a new iPhone at their Cupertino headquarters.",
    # ... ideally 100–1000 diverse sentences from your domain
]

labels = ["person", "organization", "location", "date"]

# One-time compression step
model.compress_prompt_embeddings(calibration_texts, labels, batch_size=16)

# Inference now uses the precomputed prompts — no need to pass labels again
entities = model.predict_entities(
    "Tim Cook visited Berlin last Tuesday.",
    labels,               # must match (order-insensitive) the compressed set
    threshold=0.5,
)

# Persist the compressed model
model.save_pretrained("./gliner-compressed")

Relation extraction

For relex models (UniEncoderSpanRelexModel / UniEncoderTokenRelexModel), pass rel_labels so the <<REL>> prompt embeddings are compressed as well:

model.compress_prompt_embeddings(
    texts=calibration_texts,
    labels=["person", "organization", "location"],
    rel_labels=["works_for", "located_in", "founder_of"],
    batch_size=8,
)

End-to-end distillation

Compression alone can dip quality because averaged prompt embeddings drop context-specific signal. Pass distill=True to recover it in a single call: the raw (pre-compression) model first generates pseudo-labels over texts, prompts are then compressed, and the compressed model is fine-tuned on those pseudo-labels — no separate script required.

model.compress_prompt_embeddings(
    texts=calibration_texts,     # also used as the distillation corpus
    labels=labels,
    batch_size=16,
    distill=True,
    distill_threshold=0.3,       # pseudo-label confidence cutoff
    distill_epochs=3,
    distill_lr=1e-5,
    distill_output_dir="./distill_ckpt",
)

Relevant knobs:

  • distill_threshold: confidence cutoff used when the raw model produces pseudo-labels. Lower values widen the training signal but add noise.

  • distill_epochs, distill_lr: fine-tuning schedule.

  • distill_batch_size: defaults to batch_size if omitted.

  • distill_output_dir: forwarded to train_model.

  • distill_train_kwargs: dict of extra kwargs merged into the underlying train_model call (e.g. to override save_strategy, logging_steps, etc.).

Pseudo-labels are generated from the same texts used for compression, so one diverse in-domain corpus serves both roles.

Tips and Best Practices

  1. Choose the right model architecture:

    • UniEncoder: General purpose, < 30 entity types

    • BiEncoder: Many entity types (50-200+)

    • Token-level: Long entity spans

    • Relation extraction: Knowledge graph construction

  2. Optimize threshold for your use case:

    • High precision: threshold = 0.6-0.8

    • Balanced: threshold = 0.4-0.6

    • High recall: threshold = 0.2-0.4

  3. Use batch processing for multiple documents:

    • More efficient GPU utilization

    • Faster overall processing

  4. Pre-compute label embeddings (BiEncoder):

    • Cache embeddings when processing many documents

    • Significant speedup for production use

  5. Enable FlashDeBERTa:

    • ~3x speed improvement

    • No accuracy loss

  6. Use appropriate labels:

    • Specific labels work better than generic ones

    • “company” > “entity”

    • “medication” > “word”

Troubleshooting

Low Accuracy

# Try lowering the threshold
entities = model.predict_entities(text, labels, threshold=0.3)

# Use more specific labels
labels = ["tech_company", "software_product", "founder"]  # Specific
# instead of
labels = ["organization", "thing", "person"]  # Too generic

# Try a larger model
model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")

Slow Inference

# Enable FlashDeBERTa
# pip install flashdeberta

# Compile model
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    compile_torch_model=True
)

# Use batch processing
entities_batch = model.inference(texts, labels, batch_size=16)

# For BiEncoder: pre-compute embeddings
label_embeds = model.encode_labels(labels)
entities = model.predict_with_embeds(text, label_embeds, labels)

Out of Memory

# Reduce batch size
entities = model.inference(texts, labels, batch_size=4)

# Use a smaller model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")

# Process on CPU
model = GLiNER.from_pretrained(
    "urchade/gliner_small-v2.1",
    map_location="cpu"
)