FuguReport

MARS: Enabling Autoregressive Models Multi-Token Generation

Authors Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun
Affiliations Nanyang Technological University / Singapore Management University / Uppsala University
Categories Method / Fine-Tuning / Instruction-based tuning for AR models, Application / Language Modeling / Multi-token generation in AR models, Evaluation / Model Efficiency / Efficiency of multitoken generation methods
License CC BY 4.0

Abstract Overview

MARS (Mask AutoRegression) is a lightweight fine-tuning method that enables instruction-tuned autoregressive language models to predict multiple tokens per forward pass while preserving their standard left-to-right decoding behavior. The method adds no extra parameters or architectural modifications and trains on existing supervised fine-tuning data using a masked-prediction objective combined with an autoregressive loss. The authors identify four gaps between autoregressive and block-masked prediction, arguing that three are eliminable design choices (attention pattern, logits alignment, generation order) while only token masking is inherent. Experiments on Qwen2.5-0.5B and 7B models across six benchmarks demonstrate that the same checkpoint can operate in one-token mode with baseline-level or slightly improved quality, or in multi-token mode with higher throughput controlled by a confidence threshold. A block-level KV caching strategy is introduced to translate algorithmic token-per-forward gains into wall-clock speedups in batch inference.

Novelty

The main novelty is demonstrating that multi-token generation can be added to an autoregressive model through fine-tuning alone—without speculative draft models, extra decoding heads, or architectural changes—by closing three identified gaps between AR and block-masked prediction. The work also introduces an auxiliary SFT loss on the clean input stream as the key mechanism for preserving autoregressive competence at larger block sizes, and proposes a block-level KV caching scheme for practical batched inference.

Results

In one-token mode, MARS matches or exceeds the AR SFT baseline on six benchmarks at both 0.5B scale (+1.7 average) and 7B scale (+1.5 average). In multi-token mode at τ=0.95, MARS-7B loses only 1.3 average points while generating 1.68 tokens per forward pass, and with block-level KV caching achieves up to 1.71× wall-clock speedup over AR with KV cache on Qwen2.5-7B at batch size 4. Ablations confirm that removing the auxiliary SFT loss causes quality to deteriorate as block size increases (average dropping from 28.4 to 22.2 at 0.5B as B grows from 4 to 16), while including it stabilizes performance across block sizes.

Key Points

  1. MARS preserves strict autoregressive compatibility by retaining causal attention, right-shifted logits, and left-to-right token acceptance, closing three of four identified gaps between AR and block-masked prediction and leaving token masking as the only inherent difference.
  2. A combined masked-prediction and clean-stream autoregressive SFT loss is empirically critical for maintaining quality at larger block sizes: without it, average accuracy at 0.5B drops by 6.2 points as block size increases from 4 to 16, while with it the drop is only 0.7 points.
  3. Inference supports a controllable speed–quality tradeoff via confidence thresholding, and a block-level KV cache is necessary to convert algorithmic token-per-forward gains into wall-clock batch speedups, achieving up to 1.71× speedup at batch size 4 on Qwen2.5-7B.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.