Fugu-MT 論文翻訳(概要): TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

論文の概要: TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

arxiv url: http://arxiv.org/abs/2602.00268v1
Date: Fri, 30 Jan 2026 19:44:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.087477
Title: TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation
Title（参考訳）: TokenTrim: 自動回帰ビデオ生成のための推論時間トケンプルーニング
Authors: Ariel Shaulov, Eitan Shaar, Amit Edenzon, Lior Wolf,
Abstract要約: 自動回帰ビデオ生成は、以前生成されたコンテンツに対して、新しいフレームのバッチを反復的に条件付けすることで、長いビデオ合成を可能にする。近年の研究では、こうしたパイプラインは、長い地平線上でエラーが蓄積され増幅される厳しい時間的ドリフトに悩まされていることが示されている。条件付けに再利用される前に、不安定な潜伏トークンを識別・削除することで、時間的ドリフトを緩和する簡易な推論時間法を提案する。
参考スコア（独自算出の注目度）: 45.36298679288268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.
Abstract（参考訳）: 自動回帰ビデオ生成は、以前生成されたコンテンツに対して、新しいフレームのバッチを反復的に条件付けすることで、長いビデオ合成を可能にする。しかし、最近の研究により、これらのパイプラインは、長い地平線上でエラーが蓄積され増幅される厳しい時間的ドリフトに悩まされていることが示されている。このドリフトは主にモデルキャパシティの不足によるものではなく、推論時エラーの伝播によるものであると仮定する。具体的には、自動回帰推論において、劣化した潜在条件付きトークンの制御不能な再利用からドリフトが発生することを主張する。この誤りの蓄積を補正するために、条件付けに再利用する前に不安定な潜伏トークンを識別・削除することにより、時間的ドリフトを緩和する簡易な推論時間法を提案する。この目的のために、不安定なトークンを、前回生成したバッチのトークンからかなり逸脱した潜在トークンとして定義し、潜在的な腐敗やセマンティックドリフトを示す。空間領域全体やモデルパラメータを変更するのではなく、自動回帰文脈から破損した潜伏トークンを明示的に除去することにより、予測できない潜伏情報が将来の生成ステップに影響を与えるのを防ぐことができる。結果として、モデルアーキテクチャを変更したり、トレーニング手順を変更したり、潜伏空間を残したりすることなく、長時間の時間的一貫性を著しく向上する。

論文の概要: TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

関連論文リスト