From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
- URL: http://arxiv.org/abs/2506.14761v1
- Date: Tue, 17 Jun 2025 17:55:11 GMT
- Title: From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
- Authors: Mathurin Videau, Badr Youbi Idrissi, Alessandro Leite, Marc Schoenauer, Olivier Teytaud, David Lopez-Paz,
- Abstract summary: Tokenization imposes a fixed granularity on the input text.<n>We introduce an autoregressive U-Net that learns to embed its own tokens as it trains.
- Score: 49.16552366851748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.
Related papers
- Improving Large Language Models with Concept-Aware Fine-Tuning [55.59287380665864]
Concept-Aware Fine-Tuning (CAFT) is a novel multi-token training method for large language models (LLMs)<n>CAFT enables the learning of sequences that span multiple tokens, fostering stronger concept-aware learning.<n>Experiments demonstrate significant improvements compared to conventional next-token finetuning methods.
arXiv Detail & Related papers (2025-06-09T14:55:00Z) - Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.<n> Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.<n>We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z) - Unveiling Multilinguality in Transformer Models: Exploring Language
Specificity in Feed-Forward Networks [12.7259425362286]
We investigate how multilingual models might leverage key-value memories.
For autoregressive models trained on two or more languages, do all neurons (across layers) respond equally to all languages?
Our findings reveal that the layers closest to the network's input or output tend to exhibit more language-specific behaviour compared to the layers in the middle.
arXiv Detail & Related papers (2023-10-24T06:45:00Z) - Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages.
We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states.
Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Dict-BERT: Enhancing Language Model Pre-training with Dictionary [42.0998323292348]
Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora.
In this work, we focus on enhancing language model pre-training by leveraging definitions of rare words in dictionaries.
We propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions.
arXiv Detail & Related papers (2021-10-13T04:29:14Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.