Distributional Properties of Subword Regularization
- URL: http://arxiv.org/abs/2408.11443v1
- Date: Wed, 21 Aug 2024 08:53:35 GMT
- Title: Distributional Properties of Subword Regularization
- Authors: Marco Cognetta, Vilém Zouhar, Naoaki Okazaki,
- Abstract summary: BPE and MaxMatch, two popular subword tokenization schemes, have dropout regularization variants.
We show that these variants are heavily biased towards a small set of tokenizations per word.
We propose an algorithm to uniformly tokenizations that we use as a drop-in replacement for the aspects of existing tokenizers.
- Score: 25.824110425757198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them. We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.
Related papers
- Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning [35.41241409574854]
We show that inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch.<n>By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias.
arXiv Detail & Related papers (2025-12-28T21:44:07Z) - Efficient semantic uncertainty quantification in language models via diversity-steered sampling [46.23327887393273]
We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding.<n>Key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution.<n>Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation.
arXiv Detail & Related papers (2025-10-24T10:06:21Z) - A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation [10.775460285501739]
We introduce a parsimony promoting regularizer based on the row-wise $L_infty$ norm of the coefficient matrix.<n>This additional penalty encourages entire rows of the coefficient matrix to vanish, thereby reducing the number of dictionary atoms activated across the dataset.
arXiv Detail & Related papers (2025-09-30T02:46:11Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning [0.0]
Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes.<n> Tokenadapt is a model-agnostic tokenizer transplantation method.<n>Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization for multi-word Supertokens.
arXiv Detail & Related papers (2025-05-14T19:00:27Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.
LCD can distort the global distribution over strings, sampling tokens based only on local information.
We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Obtaining Explainable Classification Models using Distributionally
Robust Optimization [12.511155426574563]
We study generalized linear models constructed using sets of feature value rules.
An inherent trade-off exists between rule set sparsity and its prediction accuracy.
We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors.
arXiv Detail & Related papers (2023-11-03T15:45:34Z) - Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm [45.42075576656938]
Contextual biasing refers to the problem of biasing automatic speech recognition systems towards rare entities.
We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching.
arXiv Detail & Related papers (2023-09-29T22:50:10Z) - Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling.
deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective.
This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z) - Alignment Entropy Regularization [13.904347165738491]
We use entropy to measure a model's uncertainty.
We evaluate the effect of entropy regularization in encouraging the model to distribute the probability mass only on a smaller subset of allowed alignments.
arXiv Detail & Related papers (2022-12-22T18:51:02Z) - A Sparsity-promoting Dictionary Model for Variational Autoencoders [16.61511959679188]
Structuring the latent space in deep generative models is important to yield more expressive models and interpretable representations.
We propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model.
arXiv Detail & Related papers (2022-03-29T17:13:11Z) - Single Model Ensemble for Subword Regularized Models in Low-Resource
Machine Translation [25.04086897886412]
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models.
We propose an inference strategy to address this discrepancy.
Experimental results show that the proposed strategy improves the performance of models trained with subword regularization.
arXiv Detail & Related papers (2022-03-25T09:25:47Z) - Causally-motivated Shortcut Removal Using Auxiliary Labels [63.686580185674195]
Key challenge to learning such risk-invariant predictors is shortcut learning.
We propose a flexible, causally-motivated approach to address this challenge.
We show both theoretically and empirically that this causally-motivated regularization scheme yields robust predictors.
arXiv Detail & Related papers (2021-05-13T16:58:45Z) - Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data
to Learn Robust and Invariant Representations [76.85274970052762]
Regularizing distance between embeddings/representations of original samples and augmented counterparts is a popular technique for improving robustness of neural networks.
In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings.
We show that the generic approach we identified (squared $ell$ regularized augmentation) outperforms several recent methods, which are each specially designed for one task.
arXiv Detail & Related papers (2020-11-25T22:40:09Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.