Related papers: TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

URL: http://arxiv.org/abs/2602.00268v1
Date: Fri, 30 Jan 2026 19:44:16 GMT
Title: TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation
Authors: Ariel Shaulov, Eitan Shaar, Amit Edenzon, Lior Wolf,
Abstract summary: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content.<n>Recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons.<n>We propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning.
Score: 45.36298679288268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.

Related papers

Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference [58.189320101488725]
DLLMs promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.<n>We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency.<n>We propose ReMix, a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state.
arXiv Detail & Related papers (2026-02-26T11:08:11Z)
LoL: Longer than Longer, Scaling Video Generation to Hour [50.945885467651216]
This work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay.<n>As an illustration, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
arXiv Detail & Related papers (2026-01-23T17:21:35Z)
Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics [0.7252027234425333]
We introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that emphmature over multiple update steps before being discretized.<n>We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax)<n>Additional perturbations, such as dynamics or history smoothing, can be incorporated naturally but are not required for the model to function.
arXiv Detail & Related papers (2026-01-08T11:44:34Z)
Stable Video Infinity: Infinite-Length Video Generation with Error Recycling [76.91310169118408]
We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines.<n> SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer's self-generated errors into supervisory prompts.<n>We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.
arXiv Detail & Related papers (2025-10-10T09:45:46Z)
Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis [79.98107530577576]
DisCon is a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets.<n>DisCon achieves a gFID score of 1.38 on ImageNet 256$times $256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.
arXiv Detail & Related papers (2025-07-02T14:33:52Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation [0.0]
Continuous Autoregressive Models can suffer from a decline in generation quality over extended sequences due to error accumulation during inference.<n>We introduce a novel method to address this issue by injecting random noise into the input embeddings during training.<n>This work paves the way for generating continuous embeddings in a purely autoregressive setting, opening new possibilities for real-time and interactive generative applications.
arXiv Detail & Related papers (2024-11-27T15:38:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.