Fugu-MT 論文翻訳(概要): EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

論文の概要: EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

arxiv url: http://arxiv.org/abs/2603.12267v1
Date: Thu, 12 Mar 2026 17:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.308312
Title: EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Title（参考訳）: EVATok:効率的な視覚自己回帰生成のための適応長ビデオトークン化
Authors: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu,
Abstract要約: EVATokは、$textbfE$fficient $textbfV$ideo $textbfA$daptive $textbfTok$enizersを生成するフレームワークである。 EVATok は UCF-101 上でより優れた再構成と最先端のクラス・ビデオ生成を実現する。
参考スコア（独自算出の注目度）: 80.13014959623452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
Abstract（参考訳）: 自己回帰(AR)ビデオ生成モデルは、画素を離散トークンシーケンスに圧縮するビデオトークン化器に依存している。これらのトークン列の長さは、ダウンストリーム生成計算コストと再構成品質のバランスをとるために重要である。従来のビデオトークンエーザは、異なるビデオの時間ブロックに均一なトークン割り当てを適用し、動的または複雑なものを保持しながら、単純、静的、または繰り返しセグメントにトークンを無駄にすることが多い。これは$\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizersを生成するフレームワークです。本フレームワークは,ビデオ毎に最適なトークン割り当てを推定し,高品質なトレードオフを実現するとともに,これらの最適な割り当てを高速に予測するための軽量ルータを開発し,ルータによって予測される割り当てに基づいて映像を符号化する適応型トークンライザを訓練する。我々はEVATokがビデオ再構成と下流AR生成の効率と全体的な品質を大幅に向上させることを示した。ビデオセマンティックエンコーダを統合した高度なトレーニングレシピにより、EVATokは、従来のLARPや固定長のベースラインと比較して、平均トークン使用量を24.4%削減し、UCF-101上でより優れた再構築と最先端のクラス・ツー・ビデオ生成を実現している。

論文の概要: EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

関連論文リスト