gliner.data_processing.tokenizer module¶

Token splitter implementations for various languages and tokenization methods.

This module provides multiple token splitter classes for different languages and tokenization strategies, including whitespace-based, language-specific, and universal multi-language splitters.

class gliner.data_processing.tokenizer.TokenSplitterBase[source]¶

Bases: object

Base class for token splitters.

This class provides the interface for all token splitter implementations. Subclasses should implement the __call__ method to yield tokens with their start and end positions.

Initialize the token splitter.

__init__()[source]¶

Initialize the token splitter.

class gliner.data_processing.tokenizer.WhitespaceTokenSplitter[source]¶

Bases: TokenSplitterBase

Whitespace-based token splitter.

Splits text based on whitespace boundaries, treating words and symbols as separate tokens. Supports hyphenated and underscored words.

Initialize the whitespace token splitter with regex pattern.

__init__()[source]¶

Initialize the whitespace token splitter with regex pattern.

class gliner.data_processing.tokenizer.SpaCyTokenSplitter(lang=None)[source]¶

Bases: TokenSplitterBase

spaCy-based token splitter.

Uses spaCy’s language models for tokenization. Supports multiple languages through spaCy’s blank language models.

Initialize the spaCy token splitter.

Parameters:

lang – Language code for spaCy model (default: ‘en’ for English).

Raises:

ModuleNotFoundError – If spaCy is not installed.

__init__(lang=None)[source]¶

Initialize the spaCy token splitter.

Parameters:

lang – Language code for spaCy model (default: ‘en’ for English).

Raises:

ModuleNotFoundError – If spaCy is not installed.

class gliner.data_processing.tokenizer.MecabKoTokenSplitter[source]¶

Bases: TokenSplitterBase

MeCab Korean token splitter.

Uses python-mecab-ko for Korean language tokenization based on morphological analysis.

Initialize the MeCab Korean token splitter.

Raises:

ModuleNotFoundError – If python-mecab-ko is not installed.

__init__()[source]¶

Initialize the MeCab Korean token splitter.

Raises:

ModuleNotFoundError – If python-mecab-ko is not installed.

class gliner.data_processing.tokenizer.JanomeJaTokenSplitter[source]¶

Bases: TokenSplitterBase

Janome Japanese token splitter.

Uses Janome for Japanese language tokenization with morphological analysis.

Initialize the Janome Japanese token splitter.

Raises:

ModuleNotFoundError – If janome is not installed.

__init__()[source]¶

Initialize the Janome Japanese token splitter.

Raises:

ModuleNotFoundError – If janome is not installed.

class gliner.data_processing.tokenizer.JiebaTokenSplitter[source]¶

Bases: TokenSplitterBase

Jieba Chinese token splitter.

Uses Jieba for Chinese language segmentation and tokenization.

Initialize the Jieba Chinese token splitter.

Raises:

ModuleNotFoundError – If jieba is not installed.

__init__()[source]¶

Initialize the Jieba Chinese token splitter.

Raises:

ModuleNotFoundError – If jieba is not installed.

class gliner.data_processing.tokenizer.CamelArabicSplitter[source]¶

Bases: object

CAMeL Tools Arabic token splitter.

Uses CAMeL Tools for Arabic language tokenization with support for Arabic-specific linguistic features.

Initialize the CAMeL Tools Arabic token splitter.

Raises:

ModuleNotFoundError – If camel_tools is not installed.

__init__()[source]¶

Initialize the CAMeL Tools Arabic token splitter.

Raises:

ModuleNotFoundError – If camel_tools is not installed.

class gliner.data_processing.tokenizer.HindiSplitter[source]¶

Bases: object

Indic NLP Hindi token splitter.

Uses Indic NLP Library for Hindi language tokenization with support for Devanagari script.

Initialize the Hindi token splitter.

Raises:

ModuleNotFoundError – If indicnlp is not installed.

__init__()[source]¶

Initialize the Hindi token splitter.

Raises:

ModuleNotFoundError – If indicnlp is not installed.

class gliner.data_processing.tokenizer.HanLPTokenSplitter(model_name='FINE_ELECTRA_SMALL_ZH')[source]¶

Bases: TokenSplitterBase

HanLP Chinese token splitter.

Uses HanLP for Chinese language tokenization with support for multiple pre-trained models.

Initialize the HanLP token splitter.

Parameters:

model_name – Name of the HanLP pre-trained model to use (default: ‘FINE_ELECTRA_SMALL_ZH’).

Raises:
  • ModuleNotFoundError – If hanlp is not installed.

  • ValueError – If the specified model name is not available.

__init__(model_name='FINE_ELECTRA_SMALL_ZH')[source]¶

Initialize the HanLP token splitter.

Parameters:

model_name – Name of the HanLP pre-trained model to use (default: ‘FINE_ELECTRA_SMALL_ZH’).

Raises:
  • ModuleNotFoundError – If hanlp is not installed.

  • ValueError – If the specified model name is not available.

class gliner.data_processing.tokenizer.MultiLangWordsSplitter(logging=False, use_spacy=True)[source]¶

Bases: TokenSplitterBase

Multi-language token splitter with automatic language detection.

Automatically detects the input language and applies the appropriate language-specific tokenizer. Falls back to a universal splitter for unsupported languages.

Initialize the multi-language token splitter.

Parameters:
  • logging – Whether to print language detection information (default: False).

  • use_spacy – Whether to use spaCy as the universal fallback splitter. If False, uses whitespace-based splitting (default: True).

Raises:

ImportError – If langdetect is not installed.

__init__(logging=False, use_spacy=True)[source]¶

Initialize the multi-language token splitter.

Parameters:
  • logging – Whether to print language detection information (default: False).

  • use_spacy – Whether to use spaCy as the universal fallback splitter. If False, uses whitespace-based splitting (default: True).

Raises:

ImportError – If langdetect is not installed.

class gliner.data_processing.tokenizer.StanzaWordsSplitter(default_lang='en', download_on_missing=True, logging=False)[source]¶

Bases: TokenSplitterBase

Stanza-based multi-language token splitter.

Uses Stanford’s Stanza NLP library for tokenization with support for multiple languages. Automatically downloads language models when needed and falls back to a default language if detection fails.

Initialize the Stanza token splitter.

Parameters:
  • default_lang (str) – Default language code to use if detection fails (default: ‘en’).

  • download_on_missing (bool) – Whether to automatically download missing language models (default: True).

  • logging (bool) – Whether to print download and processing information (default: False).

Raises:

ModuleNotFoundError – If stanza or langdetect is not installed.

__init__(default_lang='en', download_on_missing=True, logging=False)[source]¶

Initialize the Stanza token splitter.

Parameters:
  • default_lang (str) – Default language code to use if detection fails (default: ‘en’).

  • download_on_missing (bool) – Whether to automatically download missing language models (default: True).

  • logging (bool) – Whether to print download and processing information (default: False).

Raises:

ModuleNotFoundError – If stanza or langdetect is not installed.

class gliner.data_processing.tokenizer.WordsSplitter(splitter_type='whitespace')[source]¶

Bases: TokenSplitterBase

Universal token splitter with multiple backend options.

Factory class that creates the appropriate token splitter based on the specified splitter type. Supports various language-specific and universal tokenization strategies.

Initialize the words splitter with the specified backend.

Parameters:

splitter_type – Type of splitter to use. Options are: - ‘universal’: Multi-language with auto-detection - ‘whitespace’: Simple whitespace-based splitting - ‘spacy’: spaCy-based tokenization - ‘mecab’: MeCab for Korean - ‘jieba’: Jieba for Chinese - ‘hanlp’: HanLP for Chinese - ‘janome’: Janome for Japanese - ‘camel’: CAMeL Tools for Arabic - ‘hindi’: Indic NLP for Hindi - ‘stanza’: Stanza multi-language tokenization Default is ‘whitespace’.

Raises:

ValueError – If the specified splitter_type is not implemented.

__init__(splitter_type='whitespace')[source]¶

Initialize the words splitter with the specified backend.

Parameters:

splitter_type – Type of splitter to use. Options are: - ‘universal’: Multi-language with auto-detection - ‘whitespace’: Simple whitespace-based splitting - ‘spacy’: spaCy-based tokenization - ‘mecab’: MeCab for Korean - ‘jieba’: Jieba for Chinese - ‘hanlp’: HanLP for Chinese - ‘janome’: Janome for Japanese - ‘camel’: CAMeL Tools for Arabic - ‘hindi’: Indic NLP for Hindi - ‘stanza’: Stanza multi-language tokenization Default is ‘whitespace’.

Raises:

ValueError – If the specified splitter_type is not implemented.