MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End
Language Modeling
- URL: http://arxiv.org/abs/2212.07284v1
- Date: Wed, 14 Dec 2022 15:33:44 GMT
- Title: MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End
Language Modeling
- Authors: Nathan Godey, Roman Castagn\'e, \'Eric de la Clergerie, Beno\^it Sagot
- Abstract summary: We propose MANTa, a Module for Adaptive Neural TokenizAtion.
ManTa is a differentiable tokenizer trained end-to-end with the language model.
We show that MANTa performs comparably to other models on the general-domain GLUE benchmark.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Static subword tokenization algorithms have been an essential component of
recent works on language modeling. However, their static nature results in
important flaws that degrade the models' downstream performance and robustness.
In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion.
MANTa is a differentiable tokenizer trained end-to-end with the language model.
The resulting system offers a trade-off between the expressiveness of
byte-level models and the speed of models trained using subword tokenization.
In addition, our tokenizer is highly explainable since it produces an explicit
segmentation of sequences into blocks. We evaluate our pre-trained model on
several English datasets from different domains as well as on synthetic noise.
We find that MANTa improves robustness to character perturbations and
out-of-domain data. We then show that MANTa performs comparably to other models
on the general-domain GLUE benchmark. Finally, we show that it is considerably
faster than strictly byte-level models.
Related papers
- Relation-based Counterfactual Data Augmentation and Contrastive Learning for Robustifying Natural Language Inference Models [0.0]
We propose a method in which we use token-based and sentence-based augmentation methods to generate counterfactual sentence pairs.
We show that the proposed method can improve the performance and robustness of the NLI model.
arXiv Detail & Related papers (2024-10-28T03:43:25Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LM)
This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.
We develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali [0.0]
The study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages.
We used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks.
arXiv Detail & Related papers (2024-04-28T05:26:12Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - What is the best recipe for character-level encoder-only modelling? [2.792030485253753]
This paper aims to benchmark recent progress in language understanding models that output contextualised representations at the character level.
We find that our best performing character-level model exceeds the performance of a token-based model trained with the same settings on the same data.
We believe our results demonstrate the readiness of character-level models for multilingual language representation, and encourage NLP practitioners to try them as drop-in replacements for token-based models.
arXiv Detail & Related papers (2023-05-09T14:00:15Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.