Fugu-MT 論文翻訳(概要): Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

論文の概要: Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

arxiv url: http://arxiv.org/abs/2606.06158v1
Date: Thu, 04 Jun 2026 13:31:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.819695
Title: Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting
Title（参考訳）: 時間的冗長マスキングと潜伏塗布による適応的トークン化
Authors: Kevin Dave, Sai Aditya Patkuri, Chhaya Kumar Das, Gouranga Bala, R. Venkatesh Babu, Rajeshkumar SA,
Abstract要約: 適応的ビデオトークン化(Adaptive Video tokenisation)は、シーケンスの基盤となる視覚的複雑さに基づいて、トークンの予算を動的に割り当てる。凍結連続ビデオトークンの潜伏空間は,直接利用可能な時間的冗長性を本質的に符号化していることを示す。パラメータフリー適応トークン割り当て機構を導入し,時間-L1差分に対する固定しきい値を適用した。
参考スコア（独自算出の注目度）: 18.71433885255431
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)
Abstract（参考訳）: 適応的ビデオトークン化(Adaptive Video tokenisation)は、シーケンスの基盤となる視覚的複雑さに基づいて、トークンの予算を動的に割り当てる。現在の継続的登録手法では、反復二項探索や訓練されたニューラル回帰器によってこれを達成しているが、離散的な手法では情報内容の推測にフルレートデコーダパスを必要とすることが多い。このような計算オーバーヘッドは厳密には必要ないことを実証する。凍結連続ビデオトークンの潜伏空間は、本質的に時間的冗長性を符号化し、直接利用することができることを示す。パラメータフリー適応型トークン割り当て機構を導入し、時間-L1差分に対する固定しきい値を適用し、冗長な潜在位置を識別・削除する。その結果、圧縮レートはトップダウンで強制されるのではなく、入力コンテンツから自然に現れる:静的なシーンは積極的に圧縮され、非常にダイナミックなシーケンスはより多くのトークンを保持する。落下した位置を再構成するために,軽量な分解型空間時間アテンションアーキテクチャであるLatent Inpainting Transformer (LIT)を提案する。結果として生じる推論パイプラインは非常に効率的で、1つのエンコーダパスと1つのLITフォワードパスしか必要とせず、補助的なルーティングネットワークを必要としない。 TokenBench と DAVIS のベンチマークは,近年のコンストラクタによって使用されている標準ベンチマークである ~\cite{infotok, agarwal2025cosmos} から,我々のフレームワークは,競合的再構成の忠実さを維持しながら,有意義でコンテント駆動型トークンアロケーションを実現していることを示すとともに,連続適応ベースライン (ElasticTok-CV) 上での推論時高速化と,離散情報理論ベースライン (InfoTok) に対する$\approx2\times$スピードアップ (InfoTok) を提供する。

論文の概要: Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

関連論文リスト