Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction
- URL: http://arxiv.org/abs/2512.05597v1
- Date: Fri, 05 Dec 2025 10:35:43 GMT
- Title: Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction
- Authors: Ruihong Yin, Xuepeng Shi, Oleksandr Bailo, Marco Manfredi, Theo Gevers,
- Abstract summary: We introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation.<n>Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference.<n>We show that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $sim7.5%$ additional parameters.
- Score: 31.512139444227405
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only $\sim7.5\%$ additional parameters.
Related papers
- Multi-Token Prediction via Self-Distillation [73.81494481537636]
We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
arXiv Detail & Related papers (2026-02-05T18:54:48Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing [13.521180435948791]
We propose a novel end-to-end framework for GUI perception.<n>Instead of using probability-based discrete modeling, we perform continuous modeling of coordinates.<n>This effectively mitigates the limitations inherent in the discrete output characteristics.
arXiv Detail & Related papers (2025-09-05T08:24:12Z) - Set Block Decoding is a Language Model Inference Accelerator [48.061016901663386]
We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture.<n>SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods.<n>We demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.
arXiv Detail & Related papers (2025-09-04T13:02:39Z) - Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction [12.92060812931049]
Minor changes in prompt can cause significant discrepancies in model performance.<n>We propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions.<n>Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts.
arXiv Detail & Related papers (2025-04-04T04:39:51Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based
Decoding [10.002379593718471]
A successful parse transforms an input utterance to an action that is easily understood by the system.
For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly.
arXiv Detail & Related papers (2020-10-08T01:18:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.