gliner.data_processing.tokenizer module¶
Token splitter implementations for various languages and tokenization methods.
This module provides multiple token splitter classes for different languages and tokenization strategies, including whitespace-based, language-specific, and universal multi-language splitters.
- class gliner.data_processing.tokenizer.TokenSplitterBase[source]¶
Bases:
objectBase class for token splitters.
This class provides the interface for all token splitter implementations. Subclasses should implement the __call__ method to yield tokens with their start and end positions.
Initialize the token splitter.
- class gliner.data_processing.tokenizer.WhitespaceTokenSplitter[source]¶
Bases:
TokenSplitterBaseWhitespace-based token splitter.
Splits text based on whitespace boundaries, treating words and symbols as separate tokens. Supports hyphenated and underscored words.
Initialize the whitespace token splitter with regex pattern.
- class gliner.data_processing.tokenizer.SpaCyTokenSplitter(lang=None)[source]¶
Bases:
TokenSplitterBasespaCy-based token splitter.
Uses spaCy’s language models for tokenization. Supports multiple languages through spaCy’s blank language models.
Initialize the spaCy token splitter.
- Parameters:
lang – Language code for spaCy model (default: ‘en’ for English).
- Raises:
ModuleNotFoundError – If spaCy is not installed.
- class gliner.data_processing.tokenizer.MecabKoTokenSplitter[source]¶
Bases:
TokenSplitterBaseMeCab Korean token splitter.
Uses python-mecab-ko for Korean language tokenization based on morphological analysis.
Initialize the MeCab Korean token splitter.
- Raises:
ModuleNotFoundError – If python-mecab-ko is not installed.
- class gliner.data_processing.tokenizer.JanomeJaTokenSplitter[source]¶
Bases:
TokenSplitterBaseJanome Japanese token splitter.
Uses Janome for Japanese language tokenization with morphological analysis.
Initialize the Janome Japanese token splitter.
- Raises:
ModuleNotFoundError – If janome is not installed.
- class gliner.data_processing.tokenizer.JiebaTokenSplitter[source]¶
Bases:
TokenSplitterBaseJieba Chinese token splitter.
Uses Jieba for Chinese language segmentation and tokenization.
Initialize the Jieba Chinese token splitter.
- Raises:
ModuleNotFoundError – If jieba is not installed.
- class gliner.data_processing.tokenizer.CamelArabicSplitter[source]¶
Bases:
objectCAMeL Tools Arabic token splitter.
Uses CAMeL Tools for Arabic language tokenization with support for Arabic-specific linguistic features.
Initialize the CAMeL Tools Arabic token splitter.
- Raises:
ModuleNotFoundError – If camel_tools is not installed.
- class gliner.data_processing.tokenizer.HindiSplitter[source]¶
Bases:
objectIndic NLP Hindi token splitter.
Uses Indic NLP Library for Hindi language tokenization with support for Devanagari script.
Initialize the Hindi token splitter.
- Raises:
ModuleNotFoundError – If indicnlp is not installed.
- class gliner.data_processing.tokenizer.HanLPTokenSplitter(model_name='FINE_ELECTRA_SMALL_ZH')[source]¶
Bases:
TokenSplitterBaseHanLP Chinese token splitter.
Uses HanLP for Chinese language tokenization with support for multiple pre-trained models.
Initialize the HanLP token splitter.
- Parameters:
model_name – Name of the HanLP pre-trained model to use (default: ‘FINE_ELECTRA_SMALL_ZH’).
- Raises:
ModuleNotFoundError – If hanlp is not installed.
ValueError – If the specified model name is not available.
- __init__(model_name='FINE_ELECTRA_SMALL_ZH')[source]¶
Initialize the HanLP token splitter.
- Parameters:
model_name – Name of the HanLP pre-trained model to use (default: ‘FINE_ELECTRA_SMALL_ZH’).
- Raises:
ModuleNotFoundError – If hanlp is not installed.
ValueError – If the specified model name is not available.
- class gliner.data_processing.tokenizer.MultiLangWordsSplitter(logging=False, use_spacy=True)[source]¶
Bases:
TokenSplitterBaseMulti-language token splitter with automatic language detection.
Automatically detects the input language and applies the appropriate language-specific tokenizer. Falls back to a universal splitter for unsupported languages.
Initialize the multi-language token splitter.
- Parameters:
logging – Whether to print language detection information (default: False).
use_spacy – Whether to use spaCy as the universal fallback splitter. If False, uses whitespace-based splitting (default: True).
- Raises:
ImportError – If langdetect is not installed.
- __init__(logging=False, use_spacy=True)[source]¶
Initialize the multi-language token splitter.
- Parameters:
logging – Whether to print language detection information (default: False).
use_spacy – Whether to use spaCy as the universal fallback splitter. If False, uses whitespace-based splitting (default: True).
- Raises:
ImportError – If langdetect is not installed.
- class gliner.data_processing.tokenizer.StanzaWordsSplitter(default_lang='en', download_on_missing=True, logging=False)[source]¶
Bases:
TokenSplitterBaseStanza-based multi-language token splitter.
Uses Stanford’s Stanza NLP library for tokenization with support for multiple languages. Automatically downloads language models when needed and falls back to a default language if detection fails.
Initialize the Stanza token splitter.
- Parameters:
default_lang (str) – Default language code to use if detection fails (default: ‘en’).
download_on_missing (bool) – Whether to automatically download missing language models (default: True).
logging (bool) – Whether to print download and processing information (default: False).
- Raises:
ModuleNotFoundError – If stanza or langdetect is not installed.
- __init__(default_lang='en', download_on_missing=True, logging=False)[source]¶
Initialize the Stanza token splitter.
- Parameters:
default_lang (str) – Default language code to use if detection fails (default: ‘en’).
download_on_missing (bool) – Whether to automatically download missing language models (default: True).
logging (bool) – Whether to print download and processing information (default: False).
- Raises:
ModuleNotFoundError – If stanza or langdetect is not installed.
- class gliner.data_processing.tokenizer.WordsSplitter(splitter_type='whitespace')[source]¶
Bases:
TokenSplitterBaseUniversal token splitter with multiple backend options.
Factory class that creates the appropriate token splitter based on the specified splitter type. Supports various language-specific and universal tokenization strategies.
Initialize the words splitter with the specified backend.
- Parameters:
splitter_type – Type of splitter to use. Options are: - ‘universal’: Multi-language with auto-detection - ‘whitespace’: Simple whitespace-based splitting - ‘spacy’: spaCy-based tokenization - ‘mecab’: MeCab for Korean - ‘jieba’: Jieba for Chinese - ‘hanlp’: HanLP for Chinese - ‘janome’: Janome for Japanese - ‘camel’: CAMeL Tools for Arabic - ‘hindi’: Indic NLP for Hindi - ‘stanza’: Stanza multi-language tokenization Default is ‘whitespace’.
- Raises:
ValueError – If the specified splitter_type is not implemented.
- __init__(splitter_type='whitespace')[source]¶
Initialize the words splitter with the specified backend.
- Parameters:
splitter_type – Type of splitter to use. Options are: - ‘universal’: Multi-language with auto-detection - ‘whitespace’: Simple whitespace-based splitting - ‘spacy’: spaCy-based tokenization - ‘mecab’: MeCab for Korean - ‘jieba’: Jieba for Chinese - ‘hanlp’: HanLP for Chinese - ‘janome’: Janome for Japanese - ‘camel’: CAMeL Tools for Arabic - ‘hindi’: Indic NLP for Hindi - ‘stanza’: Stanza multi-language tokenization Default is ‘whitespace’.
- Raises:
ValueError – If the specified splitter_type is not implemented.