ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
- URL: http://arxiv.org/abs/2603.03583v1
- Date: Tue, 03 Mar 2026 23:20:31 GMT
- Title: ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
- Authors: Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li,
- Abstract summary: We introduce textbfByteFlow Net, a new hierarchical architecture that removes tokenizers entirely.<n>ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations.<n>Experiments demonstrate that this chunking strategy yields substantial performance gains.
- Score: 17.871012556931067
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top-$K$ selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.
Related papers
- Proxy Compression for Language Modeling [58.904023114033954]
proxy compression is an alternative training scheme that preserves the efficiency benefits of compressed inputs.<n>Experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency.<n>As model scale increases, proxy-trained models eventually match or rival tokenizer approaches.
arXiv Detail & Related papers (2026-02-04T07:36:46Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - The End of Manual Decoding: Towards Truly End-to-End Language Models [45.96704867353608]
This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation.<n>We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values.<n>We demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline.
arXiv Detail & Related papers (2025-10-30T17:01:43Z) - FLEXITOKENS: Flexible Tokenization for Evolving Language Models [9.003053181721823]
Language models (LMs) are challenging to adapt to new data distributions by simple finetuning.<n>This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation.<n>We develop byte-level LMs with learnable tokenizers to make tokenization adaptive.
arXiv Detail & Related papers (2025-07-17T01:55:41Z) - Dynamic Chunking for End-to-End Hierarchical Sequence Modeling [17.277753030570263]
We introduce techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies.<n> incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end.<n>Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction.<n>H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without anys or explicit supervision.
arXiv Detail & Related papers (2025-07-10T17:39:37Z) - Instruction-Following Pruning for Large Language Models [58.329978053711024]
We move beyond the traditional static pruning approach of determining a fixed pruning mask for a model.<n>In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction.<n>Our approach, termed "instruction-following pruning", introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task.
arXiv Detail & Related papers (2025-01-03T20:19:14Z) - Byte Latent Transformer: Patches Scale Better Than Tokens [101.10994909832063]
Byte Latent Transformer (BLT) encodes bytes into dynamically sized patches, which serve as the primary units of computation.<n>For fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
arXiv Detail & Related papers (2024-12-13T05:33:32Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)<n>This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.<n>We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for
Compact and Efficient language model [0.0]
Excessive overhead leads to large latency and computational costs.
We propose a model accelaration approaches for large language models.
Our model achieves an 18x FLOPs speedup with an accuracy degradation of less than 8% compared to BERT.
arXiv Detail & Related papers (2023-05-21T13:30:56Z) - SlimSeg: Slimmable Semantic Segmentation with Boundary Supervision [54.16430358203348]
We propose a simple but effective slimmable semantic segmentation (SlimSeg) method, which can be executed at different capacities during inference.
We show that our proposed SlimSeg with various mainstream networks can produce flexible models that provide dynamic adjustment of computational cost and better performance.
arXiv Detail & Related papers (2022-07-13T14:41:05Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Towards Efficient Scene Understanding via Squeeze Reasoning [71.1139549949694]
We propose a novel framework called Squeeze Reasoning.
Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector.
We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks.
arXiv Detail & Related papers (2020-11-06T12:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.