gliner.training.trainer module

Custom Trainer implementation with enhanced loss functions and optimizer configuration.

This module extends the Hugging Face Transformers Trainer class to support custom loss functions (focal loss, label smoothing), flexible learning rates for different parameter groups, and robust error handling during training.

gliner.training.trainer.seed_worker(_)[source]

Set worker seed during DataLoader initialization.

Helper function to ensure reproducibility by seeding each DataLoader worker process with a unique but deterministic seed based on PyTorch’s initial seed.

Parameters:

_ – Worker ID (unused, but required by DataLoader worker_init_fn signature).

class gliner.training.trainer.TrainingArguments(output_dir=None, overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy='no', prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, eval_delay=0, torch_empty_cache_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type='linear', lr_scheduler_kwargs=<factory>, warmup_ratio=0.0, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir=None, logging_strategy='steps', logging_first_step=False, logging_steps=500, logging_nan_inf_filter=True, save_strategy='steps', save_steps=500, save_total_limit=None, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, bf16=False, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=None, local_rank=-1, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug='', dataloader_drop_last=False, eval_steps=None, dataloader_num_workers=0, dataloader_prefetch_factor=None, past_index=-1, run_name=None, disable_tqdm=None, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=None, fsdp_min_num_params=0, fsdp_config=None, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=None, parallelism_config=None, deepspeed=None, label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=None, project='huggingface', trackio_space_id='trackio', ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy='every_save', hub_token=None, hub_private_repo=None, hub_always_push=False, hub_revision=None, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, include_for_metrics=<factory>, eval_do_concat_batches=True, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, eval_on_start=False, use_liger_kernel=False, liger_kernel_config=None, eval_use_gather_object=False, average_tokens_across_devices=True, cache_dir=None, others_lr=None, others_weight_decay=0.0, focal_loss_alpha=-1, focal_loss_gamma=0, focal_loss_prob_margin=0, label_smoothing=0, loss_reduction='sum', negatives=1.0, masking='global')[source]

Bases: TrainingArguments

Extended training arguments with custom loss and optimization parameters.

Extends the standard Hugging Face TrainingArguments with additional parameters for focal loss, label smoothing, differential learning rates, and custom negative sampling strategies.

cache_dir

Directory to cache downloaded models and datasets.

Type:

str | None

optim

Optimizer to use. Defaults to “adamw_torch”.

Type:

str

others_lr

Optional separate learning rate for non-encoder parameters (e.g., classification heads). If None, uses the main learning rate.

Type:

float | None

others_weight_decay

Weight decay for non-encoder parameters when using others_lr. Defaults to 0.0.

Type:

float | None

focal_loss_alpha

Alpha parameter for focal loss. Values < 0 disable focal loss weighting. Defaults to -1.

Type:

float | None

focal_loss_gamma

Gamma (focusing parameter) for focal loss. Higher values increase focus on hard examples. Defaults to 0.

Type:

float | None

focal_loss_prob_margin

Probability margin for focal loss computation. Defaults to 0.

Type:

float | None

label_smoothing

Label smoothing factor. 0.0 means no smoothing. Defaults to 0.

Type:

float | None

loss_reduction

Reduction method for loss (‘sum’, ‘mean’, or ‘none’). Defaults to ‘sum’.

Type:

str | None

negatives

Ratio of negative samples to use. Defaults to 1.0.

Type:

float | None

masking

Masking strategy for training (‘global’ or other strategies). Defaults to ‘global’.

Type:

str | None

cache_dir: str | None = None
optim: str = 'adamw_torch'
others_lr: float | None = None
others_weight_decay: float | None = 0.0
focal_loss_alpha: float | None = -1
focal_loss_gamma: float | None = 0
focal_loss_prob_margin: float | None = 0
label_smoothing: float | None = 0
loss_reduction: str | None = 'sum'
negatives: float | None = 1.0
masking: str | None = 'global'
__init__(output_dir=None, overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy='no', prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, eval_delay=0, torch_empty_cache_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type='linear', lr_scheduler_kwargs=<factory>, warmup_ratio=0.0, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir=None, logging_strategy='steps', logging_first_step=False, logging_steps=500, logging_nan_inf_filter=True, save_strategy='steps', save_steps=500, save_total_limit=None, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, bf16=False, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=None, local_rank=-1, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug='', dataloader_drop_last=False, eval_steps=None, dataloader_num_workers=0, dataloader_prefetch_factor=None, past_index=-1, run_name=None, disable_tqdm=None, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=None, fsdp_min_num_params=0, fsdp_config=None, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=None, parallelism_config=None, deepspeed=None, label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=None, project='huggingface', trackio_space_id='trackio', ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy='every_save', hub_token=None, hub_private_repo=None, hub_always_push=False, hub_revision=None, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, include_for_metrics=<factory>, eval_do_concat_batches=True, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, eval_on_start=False, use_liger_kernel=False, liger_kernel_config=None, eval_use_gather_object=False, average_tokens_across_devices=True, cache_dir=None, others_lr=None, others_weight_decay=0.0, focal_loss_alpha=-1, focal_loss_gamma=0, focal_loss_prob_margin=0, label_smoothing=0, loss_reduction='sum', negatives=1.0, masking='global')
class gliner.training.trainer.Trainer(model=None, args=None, data_collator=None, train_dataset=None, eval_dataset=None, processing_class=None, model_init=None, compute_loss_func=None, compute_metrics=None, callbacks=None, optimizers=(None, None), optimizer_cls_and_kwargs=None, preprocess_logits_for_metrics=None)[source]

Bases: Trainer

Custom Trainer with enhanced loss functions and error handling.

Extends the Hugging Face Trainer to support: - Custom loss functions (focal loss, label smoothing) - Differential learning rates for encoder vs. other parameters - Robust error handling with automatic recovery from failed batches - Custom negative sampling and masking strategies - Persistent worker support for data loading

The trainer automatically handles CUDA out-of-memory errors and other exceptions during training by skipping problematic batches and continuing.

training_step(model, inputs, *args, **kwargs)[source]

Perform a training step on a batch of inputs.

Executes forward pass, loss computation, and backward pass for a single training batch. Includes automatic error handling to skip problematic batches without crashing the training run.

Parameters:
  • model – The model to train.

  • inputs – Dictionary of input tensors and targets for the model. The dictionary will be unpacked before being fed to the model. Most models expect targets under the ‘labels’ key.

  • *args – Additional positional arguments (unused, for compatibility).

  • **kwargs – Additional keyword arguments (unused, for compatibility).

Returns:

Training loss tensor for this batch, scaled by gradient accumulation steps. Returns a zero tensor with requires_grad=True if an error occurs.

Return type:

Tensor

Note

If an exception occurs during the training step, the method prints the error, zeros gradients, clears CUDA cache, and returns a zero loss to allow training to continue.

save_model(output_dir=None, _internal_call=False)[source]

Save the trained model to a directory.

Parameters:
  • output_dir (str | None) – Directory path where the model should be saved. If None, uses the default output directory from training arguments.

  • _internal_call (bool) – Whether this is an internal call from the Trainer. Used for compatibility with the parent class.

compute_loss(model, inputs)[source]

Compute loss using custom loss functions.

Performs forward pass with custom loss parameters including focal loss, label smoothing, and negative sampling configurations from training arguments.

Parameters:
  • model – The model to compute loss for.

  • inputs – Dictionary of input tensors including features and labels.

Returns:

Computed loss tensor.

Note

The loss function parameters (alpha, gamma, label_smoothing, etc.) are passed to the model’s forward method, so the model must support these keyword arguments.

create_optimizer()[source]

Create and configure the optimizer with parameter groups.

Sets up the optimizer with support for: - Separate learning rates for encoder and non-encoder parameters - Weight decay only for non-bias and non-LayerNorm parameters - Custom weight decay values for different parameter groups

Returns:

Configured optimizer instance.

Note

If self.args.others_lr is set, creates four parameter groups: 1. Non-encoder parameters with weight decay 2. Non-encoder parameters without weight decay 3. Encoder parameters with weight decay 4. Encoder parameters without weight decay

Otherwise, creates two standard parameter groups with and without weight decay.

prediction_step(model, inputs, prediction_loss_only, ignore_keys=None)[source]

Perform an evaluation step on the model using inputs.

Executes a single forward pass for evaluation without computing gradients.

Parameters:
  • model (Module) – The model to evaluate.

  • inputs (Dict[str, Tensor | Any]) – Dictionary of input tensors and targets for the model. The dictionary will be unpacked before being fed to the model. Most models expect targets under the ‘labels’ key.

  • prediction_loss_only (bool) – If True, only returns the loss and ignores logits and labels.

  • ignore_keys (List[str] | None) – Optional list of keys in the model output dictionary that should be ignored when gathering predictions. Currently unused.

Returns:

  • loss: Loss tensor if computed, None otherwise

  • logits: Model predictions if prediction_loss_only is False, None otherwise

  • labels: Ground truth labels if prediction_loss_only is False, None otherwise

Return type:

A tuple of (loss, logits, labels)

get_train_dataloader()[source]

Create and return the training DataLoader.

Constructs a DataLoader with appropriate sampler, collation function, and worker configuration for the training dataset. Includes seeded worker initialization for reproducibility.

Returns:

Configured and accelerator-prepared training DataLoader.

Raises:

ValueError – If train_dataset is None.

Return type:

DataLoader

Note

For IterableDataset, sampler and drop_last are not set. For regular datasets, uses the sampler from _get_train_sampler() and applies worker seeding via seed_worker function.

get_eval_dataloader(eval_dataset=None)[source]

Create and return the evaluation DataLoader.

Constructs a DataLoader for evaluation with support for persistent workers and multiple evaluation datasets. Caches DataLoaders when persistent workers are enabled to avoid recreation overhead.

Parameters:

eval_dataset (str | Dataset | None) – Evaluation dataset to use. Can be: - None: Uses self.eval_dataset - str: Uses self.eval_dataset[eval_dataset] (for named eval sets) - Dataset: Overrides self.eval_dataset directly

Returns:

Configured and accelerator-prepared evaluation DataLoader.

Raises:

ValueError – If both eval_dataset and self.eval_dataset are None.

Return type:

DataLoader

Note

When persistent_workers is True, DataLoaders are cached in self._eval_dataloaders to avoid worker process recreation between evaluation calls. The cache key is the dataset name (if string) or “eval” for the default dataset.