Advanced UsageΒΆ
π Basic Use CaseΒΆ
After installing the GLiNER library, import the GLiNER class. You can load your chosen model with GLiNER.from_pretrained and use inference to identify entities within your text.
from gliner import GLiNER
# Load a GLiNER model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
# Sample text for entity prediction
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kΙΎiΚΛtjΙnu ΚΙΛnaldu];
born 5 February 1985) is a Portuguese professional footballer who plays as a forward for
and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely
regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or
awards, a record three UEFA Men's Player of the Year Awards, and four European Golden
Shoes, the most by a European player.
"""
# Define labels for entity extraction
labels = ["person", "award", "date", "teams", "competition"]
# Perform entity prediction
entities = model.predict_entities(text, labels, threshold=0.5)
# Display predicted entities and their labels
for entity in entities:
print(entity["text"], "=>", entity["label"])
Expected Output
Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award
Understanding the OutputΒΆ
Each predicted entity is a dictionary with the following structure:
{
'start': int, # Start character position in text
'end': int, # End character position in text
'text': str, # Extracted text span
'label': str, # Predicted entity type
'score': float # Confidence score (0-1)
}
Example:
for entity in entities:
print(f"Text: {entity['text']}")
print(f"Label: {entity['label']}")
print(f"Score: {entity['score']:.3f}")
print(f"Position: [{entity['start']}:{entity['end']}]")
print("---")
Batch ProcessingΒΆ
For processing multiple texts efficiently, use the inference method:
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
# Multiple texts to process
texts = [
"Apple Inc. was founded by Steve Jobs in Cupertino, California.",
"Google LLC is headquartered in Mountain View.",
"Amazon was started by Jeff Bezos in Seattle."
]
labels = ["organization", "person", "location"]
# Process all texts at once
all_entities = model.inference(texts, labels, batch_size=3, threshold=0.5)
# Display results for each text
for i, entities in enumerate(all_entities):
print(f"\nText {i+1}: {texts[i]}")
print("Entities:")
for entity in entities:
print(f" - {entity['text']} ({entity['label']}): {entity['score']:.2f}")
Benefits of Batch Processing:
Faster: Process multiple texts in parallel
Efficient: Better GPU utilization
Scalable: Handle large document collections
Using Different Model ArchitecturesΒΆ
GLiNER supports multiple architecture variants, each optimized for different scenarios.
UniEncoder Models (Standard)ΒΆ
Best for general-purpose NER with up to ~30 entity types:
from gliner import GLiNER
# Load a standard UniEncoder model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. was founded by Steve Jobs in 1976."
labels = ["company", "person", "date"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(f"{entity['text']} => {entity['label']}")
BiEncoder Models (Scalable)ΒΆ
Best for handling many entity types (50-200+) with pre-computed label embeddings:
from gliner import GLiNER
# Load a BiEncoder model
model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")
# BiEncoders handle many entity types efficiently
labels = [
"person", "organization", "location", "date", "product", "event",
"technology", "software", "hardware", "programming_language",
"framework", "library", "database", "protocol", "standard",
# ... can handle 100+ types efficiently
]
text = "Python is a programming language created by Guido van Rossum."
entities = model.predict_entities(text, labels)
# For production: pre-compute label embeddings
label_embeddings = model.encode_labels(labels, batch_size=16)
# Then use cached embeddings for faster inference
entities = model.predict_with_embeds(
text,
label_embeddings,
labels,
threshold=0.5
)
BiEncoder Advantages:
Handle 100+ entity types without performance degradation
Pre-compute label embeddings once, reuse across documents
Faster inference when processing many documents with same entity types
Token-Level ModelsΒΆ
Best for extracting long entity spans (multi-sentence entities, summaries):
from gliner import GLiNER
# Load a token-level model
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
# Token-level models excel at long entities
text = """
The European Union is a political and economic union of 27 member states
that are located primarily in Europe. The EU has developed an internal
single market through a standardised system of laws.
"""
labels = ["organization", "number", "location", "concept"]
entities = model.predict_entities(text, labels)
for entity in entities:
print(f"{entity['text'][:50]}... => {entity['label']}")
Relation Extraction ModelsΒΆ
Extract both entities and relationships between them:
from gliner import GLiNER
# Load a relation extraction model
model = GLiNER.from_pretrained("knowledgator/gliner-relex-large-v0.5")
text = "Bill Gates founded Microsoft in 1975. The company is headquartered in Redmond."
# Define entity types and relation types
entity_labels = ["person", "organization", "date", "location"]
relation_labels = ["founded", "founded_in", "headquartered_in"]
# Extract entities and relations
entities, relations = model.inference(
[text],
labels=entity_labels,
relations=relation_labels,
threshold=0.5,
relation_threshold=0.5
)
# Display entities
print("Entities:")
for entity in entities[0]:
print(f" {entity['text']} ({entity['label']})")
# Display relations
print("\nRelations:")
for relation in relations[0]:
head = entities[0][relation['head']['entity_idx']]
tail = entities[0][relation['tail']['entity_idx']]
print(f" {head['text']} --[{relation['relation']}]--> {tail['text']}")
Expected Output
Entities:
Bill Gates (person)
Microsoft (organization)
1975 (date)
Redmond (location)
Relations:
Bill Gates --[founded]--> Microsoft
Microsoft --[founded_in]--> 1975
Microsoft --[headquartered_in]--> Redmond
Advanced ConfigurationΒΆ
Adjusting the ThresholdΒΆ
Control the precision-recall tradeoff:
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. is a technology company."
labels = ["company", "industry"]
# High threshold: Higher precision, lower recall
entities_high = model.predict_entities(text, labels, threshold=0.7)
print(f"High threshold (0.7): {len(entities_high)} entities")
# Low threshold: Lower precision, higher recall
entities_low = model.predict_entities(text, labels, threshold=0.3)
print(f"Low threshold (0.3): {len(entities_low)} entities")
# Default threshold
entities_default = model.predict_entities(text, labels) # threshold=0.5
print(f"Default threshold (0.5): {len(entities_default)} entities")
Relation extraction model also has two additional threshold parameters:
adjacency_threshold: Confidence threshold for adjacency matrix reconstruction (defaults to threshold).
relation_threshold: Confidence threshold for relations (defaults to threshold).
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Apple Inc. is a technology company founded in 1976."
labels = ["company", "industry", "date"]
relations = ["founded in"]
results = model.predict_entities(text, labels, relations=relations, threshold=0.3, adjacency_threshold=0.25, relation_threshold=0.7)
Use a lower adjacency threshold so the model can rerank and classify more pairs of entities that may be linked. Set a higher relations threshold for more specificity and better precision. Feel free to adapt all three thresholds based on your use case.### Flat vs Nested NER
Control whether entities can overlap:
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "The University of California, Berkeley is located in California."
labels = ["university", "location"]
# Flat NER: No overlapping entities (default)
entities_flat = model.predict_entities(text, labels, flat_ner=True)
print("Flat NER:", [e['text'] for e in entities_flat])
# Output: ['University of California, Berkeley', 'California']
# Nested NER: Allow overlapping entities
entities_nested = model.predict_entities(text, labels, flat_ner=False)
print("Nested NER:", [e['text'] for e in entities_nested])
# Output: ['University of California, Berkeley', 'California, Berkeley', 'California']
Multi-label ClassificationΒΆ
Allow entities to have multiple types:
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = "Dr. Smith is a cardiologist at Mayo Clinic."
labels = ["person", "doctor", "specialist", "professional", "organization", "hospital"]
# Single label per entity (default)
entities_single = model.predict_entities(text, labels, multi_label=False)
print("Single label:")
for e in entities_single:
print(f" {e['text']}: {e['label']}")
# Multiple labels per entity
entities_multi = model.predict_entities(text, labels, multi_label=True)
print("\nMulti-label:")
for e in entities_multi:
print(f" {e['text']}: {e['label']}")
Local Models and CachingΒΆ
Loading from Local DirectoryΒΆ
from gliner import GLiNER
# Load from local directory
model = GLiNER.from_pretrained("/path/to/local/model")
# Or load from HuggingFace Hub with local cache
model = GLiNER.from_pretrained(
"urchade/gliner_small-v2.1",
cache_dir="./model_cache" # Cache models locally
)
Device SelectionΒΆ
from gliner import GLiNER
# Load on GPU
model = GLiNER.from_pretrained(
"urchade/gliner_small-v2.1",
map_location="cuda" # Use GPU
)
# Load on CPU
model = GLiNER.from_pretrained(
"urchade/gliner_small-v2.1",
map_location="cpu"
)
# Check device
print(f"Model is on: {model.device}")
Reduced-precision loading (dtype)ΒΆ
Pass dtype to from_pretrained to load the weights directly at the target floating-point precision β no intermediate fp32 copy, no post-load cast:
from gliner import GLiNER
import torch
# Either a string or a torch.dtype
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", dtype="bf16", map_location="cuda")
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", dtype=torch.bfloat16, map_location="cuda")
Accepted values: "fp16" / "float16" / "half", "bf16" / "bfloat16", "fp32" / "float32" / "float", or any floating-point torch.dtype. Int/bool buffers are left untouched; non-floating dtypes (e.g. torch.int8) are rejected β use quantize="int8" for that path.
Why use dtype instead of quantize="bf16":
quantizecasts after the full fp32 state dict + fp32 model are already in memory.dtypecasts each tensor as it is read from the safetensors file and pre-casts the model shell beforeload_state_dict, so a fully-fp32 snapshot never co-exists with the loaded weights. For CPU-only loads, peak host memory during load drops from ~2Γ fp32 to ~1Γ fp32 for bf16/fp16. Formap_location="cuda", the state dict streams to GPU while the shell is CPU-side, so the saving is avoiding a simultaneous fp32 GPU state dict + fp32 GPU model β not quite a 2Γβ1Γ total-footprint reduction, but still a meaningful win on the GPU peak and on the separate post-load cast pass.
When it matters: cold starts and scalable serverless deployments (AWS Lambda, Cloud Run, Modal, RunPod serverless, autoscaled Kubernetes pods, etc.) β startup latency and peak memory directly drive cost and SLA:
Shorter cold-start on every new container (one pass instead of load + cast).
Lower peak memory lets instances fit on smaller memory tiers and reduces boot-time OOMs under memory pressure.
Faster first-inference latency after a scale-from-zero event.
dtype covers plain precision changes (bf16/fp16/fp32). For int8 / torchao / CPU dynamic quantization, keep using quantize (see below). The two can be combined if desired.
Skipping the random-init shell (low_cpu_mem_usage)ΒΆ
dtype= lowers peak memory but doesnβt speed up the load itself β even with dtype="bf16", GLiNER still allocates a fp32 random-initialized model shell, runs Kaiming/Xavier init over every parameter, casts the whole thing to bf16, then overwrites every value with the loaded weights. All of that init work is thrown away.
Pass low_cpu_mem_usage=True to skip it: the model graph is built under torch.device("meta") (shape descriptors only, no allocation, no random init), the state dict is read at the target precision, and load_state_dict(assign=True) swaps the loaded tensors directly into the meta-shell parameter slots in one pass.
model = GLiNER.from_pretrained(
"urchade/gliner_medium-v2.1",
dtype="bf16",
low_cpu_mem_usage=True,
map_location="cuda",
)
Measured on gliner_medium-v2.1 on an RTX 5090 (n=12 reps, Welch t-tested, OS page cache warmed):
path |
mean load time |
speedup |
peak host RSS delta |
|---|---|---|---|
baseline (cuda, bf16) |
3.16 s |
1.0Γ |
1361 MB |
|
1.61 s |
1.96Γ |
1004 MB |
baseline (cpu, bf16) |
3.30 s |
1.0Γ |
1597 MB |
|
1.60 s |
2.06Γ |
1225 MB |
baseline (cpu, fp32) |
3.04 s |
1.0Γ |
1598 MB |
|
1.45 s |
2.10Γ |
170 MB |
About 1.5 seconds saved on every cold start, plus 23β89% lower peak host RSS depending on dtype (the fp32 case is dramatic because safetensors mmaps the on-disk file and we never copy it into anonymous memory). Loaded parameters are bit-identical to the standard path β verified across 224 parameters and 1 buffer (position_ids, re-materialized after assign).
Default is False while the path matures β enable it explicitly when cold-start latency or peak host memory matters. low_cpu_mem_usage stacks with dtype= (use them together) and is independent of quantize= and compile_torch_model=.
Selective download (variant)ΒΆ
dtype= casts in memory but the on-disk file is still fp32, so the bytes pulled from the Hub donβt shrink. If a publisher uploads a half-precision variant of the file (model.fp16.safetensors or model.bf16.safetensors, following the transformers naming convention), pass variant= to download only that file:
model = GLiNER.from_pretrained("org/gliner_bf16-v1", variant="bf16")
# Halves bytes-on-the-wire vs. the default fp32 download (~745 MB -> ~370 MB
# for gliner_medium-v2.1) when a bf16 file is published.
Behavior β variant= is a best-effort hint, not a hard requirement:
variant=None(default): unchanged β pulls the whole repo and loadsmodel.safetensors.variant="fp16"/"bf16"and the variant is published:snapshot_downloadis filtered withallow_patternsso onlymodel.{variant}.safetensors(plus configs and tokenizer assets) is fetched.dtype=is inferred fromvariant; passing both with mismatched precisions raises.variant="fp16"/"bf16"and the variant is not published: aUserWarningis emitted and the loader falls back to the default fp32 file plus an in-memory cast β same outcome as passingdtype=alone, no error, no I/O win. The warning text tells the user the publisher hasnβt uploaded the file so the bandwidth savings didnβt apply.
This is the lever to pull for cold-start cost when bytes-on-the-wire dominate. Set variant="bf16" and forget about it β if the publisher has the variant file you get the I/O savings, and if they donβt you get the in-memory dtype= behavior with a one-line warning. The probe uses huggingface_hub.HfApi().list_repo_files (one cheap API call) before downloading.
Quantization, Compilation & FlashDeBERTaΒΆ
Combine dtype="fp16" (or "bf16") with compile_torch_model=True for up to ~1.9x faster GPU inference with zero quality loss:
from gliner import GLiNER
model = GLiNER.from_pretrained(
"urchade/gliner_medium-v2.1",
map_location="cuda",
dtype="fp16", # or "bf16" β see "Reduced-precision loading" above
compile_torch_model=True,
)
Or apply after loading:
import torch
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda")
model.to(torch.float16) # fp16 half-precision
model.compile() # torch.compile with dynamic shapes
Compilation is especially beneficial for short sequences, where the overhead of the standard eager execution is proportionally larger. For longer sequences, FlashDeBERTa is recommended as it scales much better with sequence length.
Benchmarks (CoNLL-2003 strict F1, gliner_medium-v2.1, RTX 5090):
Condition |
F1 |
Speedup |
|---|---|---|
GPU fp32 (baseline) |
0.8107 |
1.00x |
+ |
0.8107 |
1.35x |
+ compile |
0.8107 |
1.31x |
+ |
0.8107 |
1.94x |
quantize= vs dtype=:
dtype="fp16"/"bf16"β plain precision change via efficient load (see the dedicated section above). This is the only way to get half-precision inference.quantize="int8"β real int8 quantization. On CPU, built-in FBGEMM kernels (~1.6x speedup). On GPU, torchao int8 weight-only quantization (~50% memory reduction, no speed gain). Intended for models fine-tuned with quantization-aware training (QAT); stock DeBERTa-based models lose accuracy with int8.quantize=accepts only"int8"(orNone). PassingTrue,"fp16", or"bf16"raises with a migration message β those were precision downcasts, not quantization, and are handled exclusively bydtype=/model.to(...)now.
Compilation notes:
compile_torch_model=Trueuses torch.compile which JIT-compiles the model via Triton kernels. The first inference call will be slower due to compilation, but all subsequent calls benefit from the compiled graph. This is only available on Linux and WSL (not native Windows or macOS).
β‘ Accelerating Inference with Sequence PackingΒΆ
Sequence packing allows GLiNER to combine multiple short requests into a single transformer pass while keeping a block-diagonal attention mask. This drastically reduces the number of padding tokens the encoder needs to process and yields higher throughput.
Configure packing once for all predictions
from gliner import GLiNER, InferencePackingConfig model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1", map_location="cuda") packing_cfg = InferencePackingConfig( max_length=512, sep_token_id=model.data_processor.transformer_tokenizer.sep_token_id, streams_per_batch=1, ) # Enable packing for every subsequent `run`/`predict_*` call. model.configure_inference_packing(packing_cfg) texts = ["Email CEO to approve budget", "Schedule yearly medical checkup"] labels = ["person", "organization", "action"] predictions = model.inference(texts, labels, batch_size=16)
You can override or disable the default configuration on a per-call basis by passing
packing_config=<new_cfg>orpacking_config=Nonerespectively when invokingmodel.inferenceormodel.predict_entities.Benchmark the impact
The
bench/bench_gliner_e2e.pyscript can stress the full GLiNER pipeline in addition to encoder-only Hugging Face models:python bench/bench_gliner_e2e.pyTo isolate and measure the impact on the encoder:
python bench/bench_infer_packing.py --batch_size 32 --scenario short_zipf
π Usage with spaCyΒΆ
GLiNER can be seamlessly integrated with spaCy. To begin, install the gliner-spacy library via pip:
pip install gliner-spacy
Following installation, you can add GLiNER to a spaCy NLP pipeline. Hereβs how to integrate it with a blank English pipeline; however, itβs compatible with any spaCy model.
import spacy
from gliner_spacy.pipeline import GlinerSpacy
# Configuration for GLiNER integration
custom_spacy_config = {
"gliner_model": "urchade/gliner_mediumv2.1",
"chunk_size": 250,
"labels": ["person", "organization", "email"],
"style": "ent",
"threshold": 0.3,
"map_location": "cpu" # only available in v.0.0.7
}
# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)
# Example text for entity detection
text = "This is a text about Bill Gates and Microsoft."
# Process the text with the pipeline
doc = nlp(text)
# Output detected entities
for ent in doc.ents:
print(ent.text, ent.label_, ent._.score) # ent._.score only available in v. 0.0.7
Expected OutputΒΆ
Bill Gates => person
Microsoft => organization
πββοΈ Using FlashDeBERTaΒΆ
Most GLiNER models use the DeBERTa encoder as their backbone. This architecture offers strong token classification performance and typically requires less data to achieve good results. However, a major drawback has been its slower inference speed, and until recently, there was no flash attention implementation compatible with DeBERTaβs disentangled attention mechanism.
To address this, FlashDeBERTa was introduced.
InstallationΒΆ
pip install flashdeberta -U
Before using FlashDeBERTa, please make sure that you have transformers>=4.51.3.
UsageΒΆ
To enable FlashDeBERTa, set the USE_FLASHDEBERTA environment variable before loading the model:
export USE_FLASHDEBERTA=1
Or set it directly in Python:
import os
os.environ["USE_FLASHDEBERTA"] = "1"
from gliner import GLiNER
# FlashDeBERTa will be used when USE_FLASHDEBERTA is set and the package is installed
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")
# To explicitly use eager attention instead
model = GLiNER.from_pretrained(
"urchade/gliner_medium-v2.1",
_attn_implementation="eager"
)
Performance Boost: FlashDeBERTa provides up to a 3Γ speed boost for typical sequence lengthsβand even greater improvements for longer sequences.
π οΈ High-Level Pipelines {#pipelines}ΒΆ
GLiNER-Multitask models are designed to extract relevant information from plain text based on user-provided custom prompts. These encoder-based multitask models enable efficient and controllable information extraction with a single model, reducing computational and storage costs.
Supported Tasks:
Named Entity Recognition (NER): Identify and categorize entities
Relation Extraction: Detect relationships between entities
Summarization: Extract key sentences
Sentiment Extraction: Identify sentiment-bearing text spans
Key-Phrase Extraction: Extract important phrases and keywords
Question-Answering: Find answers to questions in text
Open Information Extraction: Extract information based on open prompts
Text Classification: Classify text against predefined labels
ClassificationΒΆ
The GLiNERClassifier pipeline performs text classification tasks:
from gliner import GLiNER
from gliner.multitask import GLiNERClassifier
# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
classifier = GLiNERClassifier(model=model)
# Single-label classification
text = "SpaceX successfully launched a new rocket into orbit."
labels = ['science', 'technology', 'business', 'sports']
predictions = classifier(text, classes=labels, multi_label=False)
print(predictions)
# Output: [[{'label': 'technology', 'score': 0.84}]]
# Multi-label classification
predictions_multi = classifier(text, classes=labels, multi_label=True)
print(predictions_multi)
# Output: [[{'label': 'technology', 'score': 0.84}, {'label': 'science', 'score': 0.72}]]
Evaluation on Dataset:
# Evaluate on HuggingFace dataset
metrics = classifier.evaluate('dair-ai/emotion')
print(metrics)
# Output: {'micro': 0.4465, 'macro': 0.4243, 'weighted': 0.4884}
Question-AnsweringΒΆ
The GLiNERQuestionAnswerer pipeline extracts answers from text:
from gliner import GLiNER
from gliner.multitask import GLiNERQuestionAnswerer
# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
answerer = GLiNERQuestionAnswerer(model=model)
# Extract answer
text = "SpaceX was founded by Elon Musk in 2002 to reduce space transportation costs."
question = "Who founded SpaceX?"
predictions = answerer(text, questions=question)
print(predictions)
# Output: [[{'answer': 'Elon Musk', 'score': 0.998}]]
# Multiple questions
questions = ["Who founded SpaceX?", "When was SpaceX founded?", "What is SpaceX's goal?"]
predictions = answerer(text, questions=questions)
for q, pred in zip(questions, predictions):
print(f"Q: {q}")
print(f"A: {pred[0]['answer']} (score: {pred[0]['score']:.3f})")
Evaluation on SQuAD:
from gliner.multitask import GLiNERSquadEvaluator
evaluator = GLiNERSquadEvaluator(model_id="knowledgator/gliner-multitask-large-v0.5")
metrics = evaluator.evaluate(threshold=0.25)
print(metrics)
# Output: {'exact': 29.41, 'f1': 29.80, 'total': 11873, ...}
Relation ExtractionΒΆ
The GLiNERRelationExtractor pipeline extracts relationships between entities:
from gliner import GLiNER
from gliner.multitask import GLiNERRelationExtractor
# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
relation_extractor = GLiNERRelationExtractor(model=model)
# Extract relations
text = "Elon Musk founded SpaceX in 2002 to reduce space transportation costs."
entities = ['person', 'company', 'year', 'goal']
relations = ['founded', 'founded_in', 'goal']
predictions = relation_extractor(
text,
entities=entities,
relations=relations,
threshold=0.5
)
for pred in predictions[0]:
print(f"{pred['source']} --[{pred['relation']}]--> {pred['target']}")
print(f" Score: {pred['score']:.3f}")
Expected Output
Elon Musk --[founded]--> SpaceX
Score: 0.958
SpaceX --[founded_in]--> 2002
Score: 0.912
Open Information ExtractionΒΆ
The GLiNEROpenExtractor pipeline extracts information based on custom prompts:
from gliner import GLiNER
from gliner.multitask import GLiNEROpenExtractor
# Initialize with custom prompt
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
extractor = GLiNEROpenExtractor(
model=model,
prompt="Extract all companies related to space technologies"
)
# Extract information
text = """
Elon Musk founded SpaceX in 2002 to reduce space transportation costs.
Also Elon is founder of Tesla, NeuroLink and many other companies.
"""
labels = ['company']
predictions = extractor(text, labels=labels, threshold=0.5)
for pred in predictions[0]:
print(f"{pred['text']} (score: {pred['score']:.3f})")
Expected Output
SpaceX (score: 0.962)
Tesla (score: 0.936)
NeuroLink (score: 0.912)
Custom Prompts for Different Tasks:
# Extract product descriptions
extractor = GLiNEROpenExtractor(
model=model,
prompt="Extract product descriptions and features from the text"
)
# Extract technical specifications
extractor = GLiNEROpenExtractor(
model=model,
prompt="Extract technical specifications and requirements"
)
# Extract contact information
extractor = GLiNEROpenExtractor(
model=model,
prompt="Extract all contact information including emails and phone numbers"
)
SummarizationΒΆ
The GLiNERSummarizer pipeline extracts key sentences for summarization:
from gliner import GLiNER
from gliner.multitask import GLiNERSummarizer
# Initialize
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
summarizer = GLiNERSummarizer(model=model)
# Extract summary
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop
and sell BASIC interpreters for the Altair 8800. During his career at Microsoft,
Gates held the positions of chairman, chief executive officer, president and chief
software architect, while also being the largest individual shareholder until May 2014.
"""
summary = summarizer(text, threshold=0.1)
print(summary)
Expected Output
['Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop
and sell BASIC interpreters for the Altair 8800.']
Controlling Summary Length:
# More selective (higher threshold = shorter summary)
summary_short = summarizer(text, threshold=0.5)
# More comprehensive (lower threshold = longer summary)
summary_long = summarizer(text, threshold=0.05)
Advanced Relation Extraction with UTCAΒΆ
For more nuanced control over relation extraction, use the utca framework:
InstallationΒΆ
pip install utca -U
Setting Up the PipelineΒΆ
from utca.core import RenameAttribute
from utca.implementation.predictors import GLiNERPredictor, GLiNERPredictorConfig
from utca.implementation.tasks import (
GLiNER,
GLiNERPreprocessor,
GLiNERRelationExtraction,
GLiNERRelationExtractionPreprocessor,
)
# Initialize predictor
predictor = GLiNERPredictor(
GLiNERPredictorConfig(
model_name="knowledgator/gliner-multitask-large-v0.5",
device="cuda:0", # Use "cpu" for CPU inference
)
)
# Create pipeline
pipe = (
GLiNER( # Extract entities
predictor=predictor,
preprocess=GLiNERPreprocessor(threshold=0.7)
)
| RenameAttribute("output", "entities") # Prepare for relation extraction
| GLiNERRelationExtraction( # Extract relations
predictor=predictor,
preprocess=(
GLiNERPreprocessor(threshold=0.5)
| GLiNERRelationExtractionPreprocessor()
)
)
)
Running the PipelineΒΆ
text = """
Microsoft was founded by Bill Gates and Paul Allen on April 4, 1975 to develop
and sell BASIC interpreters for the Altair 8800. During his career at Microsoft,
Gates held the positions of chairman, chief executive officer, president and chief
software architect, while also being the largest individual shareholder until May 2014.
"""
result = pipe.run({
"text": text,
"labels": ["organization", "person", "position", "date"],
"relations": [
{
"relation": "founder",
"pairs_filter": [("organization", "person")], # Only consider org-person pairs
"distance_threshold": 100, # Max distance between entities (in characters)
},
{
"relation": "inception_date",
"pairs_filter": [("organization", "date")],
},
{
"relation": "held_position",
"pairs_filter": [("person", "position")],
}
]
})
# Display results
for relation in result["output"]:
source = relation['source']['span']
target = relation['target']['span']
rel_type = relation['relation']
score = relation['score']
print(f"{source} --[{rel_type}]--> {target} (score: {score:.3f})")
Expected Output
Microsoft --[founder]--> Bill Gates (score: 0.968)
Microsoft --[founder]--> Paul Allen (score: 0.863)
Microsoft --[inception_date]--> April 4, 1975 (score: 0.997)
Bill Gates --[held_position]--> chairman (score: 0.966)
Bill Gates --[held_position]--> chief executive officer (score: 0.947)
Bill Gates --[held_position]--> president (score: 0.973)
Bill Gates --[held_position]--> chief software architect (score: 0.950)
Advanced UTCA FeaturesΒΆ
Distance Filtering:
# Only extract relations where entities are close together
relations = [
{
"relation": "works_for",
"pairs_filter": [("person", "organization")],
"distance_threshold": 50, # Entities must be within 50 characters
}
]
Multiple Relation Types:
# Define complex relation schemas
relations = [
{
"relation": "employed_by",
"pairs_filter": [("person", "organization")],
},
{
"relation": "located_in",
"pairs_filter": [("organization", "location"), ("person", "location")],
},
{
"relation": "acquired_by",
"pairs_filter": [("organization", "organization")],
},
]
Practical ExamplesΒΆ
Example 1: Extract Company InformationΒΆ
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
text = """
Apple Inc. is headquartered in Cupertino, California. The company was founded
by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. Tim Cook is the
current CEO. Apple's main products include iPhone, iPad, and Mac computers.
"""
labels = ["company", "location", "person", "position", "product", "date"]
entities = model.predict_entities(text, labels, threshold=0.5)
# Organize by type
from collections import defaultdict
by_type = defaultdict(list)
for entity in entities:
by_type[entity['label']].append(entity['text'])
for label, items in by_type.items():
print(f"{label}: {', '.join(set(items))}")
Example 2: Process Scientific PapersΒΆ
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")
abstract = """
We introduce GPT-4, a large-scale multimodal model developed by OpenAI.
The model was trained on a diverse dataset and exhibits strong performance
on various benchmarks including MMLU, HumanEval, and GSM-8K.
"""
labels = [
"model_name", "organization", "dataset", "benchmark",
"metric", "task", "method"
]
entities = model.predict_entities(abstract, labels, threshold=0.4)
print("Extracted Information:")
for entity in entities:
print(f" {entity['label']}: {entity['text']}")
Example 3: Analyze News ArticlesΒΆ
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-bi-small-v1.0")
article = """
Tesla CEO Elon Musk announced on Twitter that the company will open a new
Gigafactory in Austin, Texas. The facility will produce the Cybertruck and
Model Y vehicles. Construction began in July 2020 and operations started in 2021.
"""
labels = [
"person", "position", "company", "location", "facility",
"product", "date", "event"
]
# Process with BiEncoder for efficiency
entities = model.predict_entities(article, labels, threshold=0.5)
# Group related entities
print("Key Information:")
print(f"- Company: {[e['text'] for e in entities if e['label'] == 'company']}")
print(f"- Location: {[e['text'] for e in entities if e['label'] == 'location']}")
print(f"- Products: {[e['text'] for e in entities if e['label'] == 'product']}")
print(f"- Timeline: {[e['text'] for e in entities if e['label'] == 'date']}")
β‘ Prompt Compression (Precomputed Prompt Embeddings)ΒΆ
For uni-encoder models (span, token, and relation-extraction variants) you can
precompute the prompt embeddings for a fixed label set and reuse them at
inference time. In precomputed mode the encoder receives only the text
(no <<ENT>>label1<<ENT>>...<<SEP>> prefix), which shortens the input sequence,
reduces attention cost, and can noticeably speed up inference β at a small
accuracy trade-off versus re-encoding the prompts on every call.
How it worksΒΆ
BaseGLiNER.compress_prompt_embeddings(texts, labels, rel_labels=None, batch_size=8, distill=False, distill_threshold=0.3, distill_epochs=3, distill_lr=1e-5, distill_batch_size=None, distill_output_dir="./distill_ckpt", distill_train_kwargs=None):
Runs the normal forward pass over
(texts, labels)pairs.Extracts the per-label prompt embedding (the
<<ENT>>token representation, pre-projection) from each example.Averages across all examples to produce an
(L, D)matrix stored as a non-trainable parameter on the underlying model (model.precomputed_prompts).Sets
config.precomputed_prompts_mode = Trueand writesconfig.id_to_classes, so subsequentpredict_entities/forwardcalls skip prompt-prepending and look up the stored embeddings instead.
The stored embeddings travel with state_dict, so save_pretrained /
from_pretrained round-trip them automatically. Training can continue after
compression β the stored matrix is frozen but everything else keeps training.
Basic usage (entity extraction)ΒΆ
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
# Representative texts from your target domain. They do not need labels;
# they are only used as contexts while averaging the prompt representations.
calibration_texts = [
"Barack Obama was born in Honolulu, Hawaii.",
"Apple announced a new iPhone at their Cupertino headquarters.",
# ... ideally 100β1000 diverse sentences from your domain
]
labels = ["person", "organization", "location", "date"]
# One-time compression step
model.compress_prompt_embeddings(calibration_texts, labels, batch_size=16)
# Inference now uses the precomputed prompts β no need to pass labels again
entities = model.predict_entities(
"Tim Cook visited Berlin last Tuesday.",
labels, # must match (order-insensitive) the compressed set
threshold=0.5,
)
# Persist the compressed model
model.save_pretrained("./gliner-compressed")
Relation extractionΒΆ
For relex models (UniEncoderSpanRelexModel / UniEncoderTokenRelexModel),
pass rel_labels so the <<REL>> prompt embeddings are compressed as well:
model.compress_prompt_embeddings(
texts=calibration_texts,
labels=["person", "organization", "location"],
rel_labels=["works_for", "located_in", "founder_of"],
batch_size=8,
)
End-to-end distillationΒΆ
Compression alone can dip quality because averaged prompt embeddings drop
context-specific signal. Pass distill=True to recover it in a single call:
the raw (pre-compression) model first generates pseudo-labels over texts,
prompts are then compressed, and the compressed model is fine-tuned on those
pseudo-labels β no separate script required.
model.compress_prompt_embeddings(
texts=calibration_texts, # also used as the distillation corpus
labels=labels,
batch_size=16,
distill=True,
distill_threshold=0.3, # pseudo-label confidence cutoff
distill_epochs=3,
distill_lr=1e-5,
distill_output_dir="./distill_ckpt",
)
Relevant knobs:
distill_threshold: confidence cutoff used when the raw model produces pseudo-labels. Lower values widen the training signal but add noise.distill_epochs,distill_lr: fine-tuning schedule.distill_batch_size: defaults tobatch_sizeif omitted.distill_output_dir: forwarded totrain_model.distill_train_kwargs: dict of extra kwargs merged into the underlyingtrain_modelcall (e.g. to overridesave_strategy,logging_steps, etc.).
Pseudo-labels are generated from the same texts used for compression, so one
diverse in-domain corpus serves both roles.
Tips and Best PracticesΒΆ
Choose the right model architecture:
UniEncoder: General purpose, < 30 entity types
BiEncoder: Many entity types (50-200+)
Token-level: Long entity spans
Relation extraction: Knowledge graph construction
Optimize threshold for your use case:
High precision: threshold = 0.6-0.8
Balanced: threshold = 0.4-0.6
High recall: threshold = 0.2-0.4
Use batch processing for multiple documents:
More efficient GPU utilization
Faster overall processing
Pre-compute label embeddings (BiEncoder):
Cache embeddings when processing many documents
Significant speedup for production use
Enable FlashDeBERTa:
~3x speed improvement
No accuracy loss
Use appropriate labels:
Specific labels work better than generic ones
βcompanyβ > βentityβ
βmedicationβ > βwordβ
TroubleshootingΒΆ
Low AccuracyΒΆ
# Try lowering the threshold
entities = model.predict_entities(text, labels, threshold=0.3)
# Use more specific labels
labels = ["tech_company", "software_product", "founder"] # Specific
# instead of
labels = ["organization", "thing", "person"] # Too generic
# Try a larger model
model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")
Slow InferenceΒΆ
# Enable FlashDeBERTa
# pip install flashdeberta
# Compile model
model = GLiNER.from_pretrained(
"urchade/gliner_small-v2.1",
compile_torch_model=True
)
# Use batch processing
entities_batch = model.inference(texts, labels, batch_size=16)
# For BiEncoder: pre-compute embeddings
label_embeds = model.encode_labels(labels)
entities = model.predict_with_embeds(text, label_embeds, labels)
Out of MemoryΒΆ
# Reduce batch size
entities = model.inference(texts, labels, batch_size=4)
# Use a smaller model
model = GLiNER.from_pretrained("urchade/gliner_small-v2.1")
# Process on CPU
model = GLiNER.from_pretrained(
"urchade/gliner_small-v2.1",
map_location="cpu"
)