MARS: Enabling Autoregressive Models Multi-Token Generation
Abstract Overview
MARS (Mask AutoRegression) is a lightweight fine-tuning method that enables instruction-tuned autoregressive language models to predict multiple tokens per forward pass while preserving their standard left-to-right decoding behavior. The method adds no extra parameters or architectural modifications and trains on existing supervised fine-tuning data using a masked-prediction objective combined with an autoregressive loss. The authors identify four gaps between autoregressive and block-masked prediction, arguing that three are eliminable design choices (attention pattern, logits alignment, generation order) while only token masking is inherent. Experiments on Qwen2.5-0.5B and 7B models across six benchmarks demonstrate that the same checkpoint can operate in one-token mode with baseline-level or slightly improved quality, or in multi-token mode with higher throughput controlled by a confidence threshold. A block-level KV caching strategy is introduced to translate algorithmic token-per-forward gains into wall-clock speedups in batch inference.
Novelty
The main novelty is demonstrating that multi-token generation can be added to an autoregressive model through fine-tuning alone—without speculative draft models, extra decoding heads, or architectural changes—by closing three identified gaps between AR and block-masked prediction. The work also introduces an auxiliary SFT loss on the clean input stream as the key mechanism for preserving autoregressive competence at larger block sizes, and proposes a block-level KV caching scheme for practical batched inference.
Results
In one-token mode, MARS matches or exceeds the AR SFT baseline on six benchmarks at both 0.5B scale (+1.7 average) and 7B scale (+1.5 average). In multi-token mode at τ=0.95, MARS-7B loses only 1.3 average points while generating 1.68 tokens per forward pass, and with block-level KV caching achieves up to 1.71× wall-clock speedup over AR with KV cache on Qwen2.5-7B at batch size 4. Ablations confirm that removing the auxiliary SFT loss causes quality to deteriorate as block size increases (average dropping from 28.4 to 22.2 at 0.5B as B grows from 4 to 16), while including it stabilizes performance across block sizes.
Key Points
- MARS preserves strict autoregressive compatibility by retaining causal attention, right-shifted logits, and left-to-right token acceptance, closing three of four identified gaps between AR and block-masked prediction and leaving token masking as the only inherent difference.
- A combined masked-prediction and clean-stream autoregressive SFT loss is empirically critical for maintaining quality at larger block sizes: without it, average accuracy at 0.5B drops by 6.2 points as block size increases from 4 to 16, while with it the drop is only 0.7 points.
- Inference supports a controllable speed–quality tradeoff via confidence thresholding, and a block-level KV cache is necessary to convert algorithmic token-per-forward gains into wall-clock batch speedups, achieving up to 1.71× speedup at batch size 4 on Qwen2.5-7B.
References
- arXiv: https://arxiv.org/abs/2604.07023v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.07023v1
- Hugging Face Papers: https://huggingface.co/papers/2604.07023
- GitHub: https://github.com/Xalp/MARS