Two Counterexamples to Tokenization and the Noiseless Channel
- URL: http://arxiv.org/abs/2402.14614v2
- Date: Thu, 29 Feb 2024 09:20:37 GMT
- Title: Two Counterexamples to Tokenization and the Noiseless Channel
- Authors: Marco Cognetta and Vil\'em Zouhar and Sangwhan Moon and Naoaki Okazaki
- Abstract summary: In Tokenization and the Noiseless Channel, R'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer.
Although useful, the predictive power of this metric is not perfect, and the authors note there are additional qualities of a good tokenization scheme that R'enyi efficiency alone cannot capture.
We describe two variants of BPE tokenization which can arbitrarily increase R'enyi efficiency while decreasing the downstream model performance.
- Score: 24.127593302335164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi
efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer:
for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of
the unigram distribution should be chosen. The R\'enyi efficiency is thus
treated as a predictor of downstream performance (e.g., predicting BLEU for a
machine translation task), without the expensive step of training multiple
models with different tokenizers. Although useful, the predictive power of this
metric is not perfect, and the authors note there are additional qualities of a
good tokenization scheme that R\'enyi efficiency alone cannot capture.
We describe two variants of BPE tokenization which can arbitrarily increase
R\'enyi efficiency while decreasing the downstream model performance. These
counterexamples expose cases where R\'enyi efficiency fails as an intrinsic
tokenization metric and thus give insight for building more accurate
predictors.
Related papers
- Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z) - ZipR1: Reinforcing Token Sparsity in MLLMs [25.92720050123066]
We propose a simple RL-based post-training method named textbfZipR1 that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward.
Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80% to 25% with a minimal accuracy reduction on 13 image and video benchmarks.
arXiv Detail & Related papers (2025-04-23T01:45:55Z) - Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Loop Neural Networks for Parameter Sharing [1.1049608786515839]
We introduce a novel Loop Neural Network, which achieves better performance by utilizing longer computational time without increasing the model size.
Our approach revisits the input multiple times, refining the prediction by iteratively looping over a subset of the model with residual connections.
We demonstrate the effectiveness of this method through experiments comparing versions of GPT-2 with our loop models, showing improved performance in language modeling tasks while maintaining similar parameter counts.
arXiv Detail & Related papers (2024-09-21T17:07:42Z) - Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism.
We propose integrating two strategies: token pruning and token combining.
Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z) - Variance-Reducing Couplings for Random Features [57.73648780299374]
Random features (RFs) are a popular technique to scale up kernel methods in machine learning.
We find couplings to improve RFs defined on both Euclidean and discrete input spaces.
We reach surprising conclusions about the benefits and limitations of variance reduction as a paradigm.
arXiv Detail & Related papers (2024-05-26T12:25:09Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Target Variable Engineering [0.0]
We compare the predictive performance of regression models trained to predict numeric targets vs. classifiers trained to predict their binarized counterparts.
We find that regression requires significantly more computational effort to converge upon the optimal performance.
arXiv Detail & Related papers (2023-10-13T23:12:21Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and
Self-Control Gradient Estimator [62.26981903551382]
Variational auto-encoders (VAEs) with binary latent variables provide state-of-the-art performance in terms of precision for document retrieval.
We propose a pairwise loss function with discrete latent VAE to reward within-class similarity and between-class dissimilarity for supervised hashing.
This new semantic hashing framework achieves superior performance compared to the state-of-the-arts.
arXiv Detail & Related papers (2020-05-21T06:11:33Z) - An Information Bottleneck Approach for Controlling Conciseness in
Rationale Extraction [84.49035467829819]
We show that it is possible to better manage this trade-off by optimizing a bound on the Information Bottleneck (IB) objective.
Our fully unsupervised approach jointly learns an explainer that predicts sparse binary masks over sentences, and an end-task predictor that considers only the extracted rationale.
arXiv Detail & Related papers (2020-05-01T23:26:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.