Fugu-MT 論文翻訳(概要): VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

論文の概要: VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

arxiv url: http://arxiv.org/abs/2604.12887v1
Date: Tue, 14 Apr 2026 15:37:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.538869
Title: VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
Title（参考訳）: VideoFlexTok:フレキシブルな長さの粗大なビデオトークン化
Authors: Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir,
Abstract要約: VideoFlexTokは、粗い方法で構造化された可変長のトークンシーケンスでビデオを表現する。生成フローデコーダは、任意のトークン数からリアルなビデオ再構成を可能にする。
参考スコア（独自算出の注目度）: 19.140563809250214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details "pixel-by-pixel" irrespective of the video's inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner -- where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.
Abstract（参考訳）: ビジュアルトークンーは、高次元の生のピクセルを下流モデリングのための圧縮表現にマッピングする。圧縮以外にも、トークンーはどの情報が保存され、どのように整理されるかを規定する。ビデオトークン化のデファクトスタンダードなアプローチは、ビデオをトークンの時空間3Dグリッドとして表現することであり、それぞれが元の信号で対応するローカル情報をキャプチャする。これは、ビデオ固有の複雑さに関係なく、低レベルのすべての詳細を"ピクセル・バイ・ピクセル"を予測することを学ぶために、例えば、テキスト・ツー・ビデオモデルのようなトークンを消費するダウンストリームモデルを必要とする。このビデオFlexTokは、粗い方法で構造化されたトークンの可変長シーケンスでビデオを表現するもので、最初のトークンがセマンティクスやモーションなどの抽象情報を(創発的に)キャプチャし、後にトークンが細かな詳細を付加する。生成フローデコーダは、任意のトークン数からリアルなビデオ再構成を可能にする。この表現構造により、下流のニーズに応じてトークンカウントを適応させ、同じ予算でベースラインよりも長いビデオをエンコードすることができる。ビデオFlexTokをクラスおよびテキスト・ツー・ビデオ生成タスクで評価し,5倍のモデル(1.1B×5.2B)で生成品質(gFVDおよびViCLIPスコア)を達成することで,3Dグリッドトークンと比較して,より効率的なトレーニングを実現することを示す。最後に,ビデオFlexTokが10秒間81フレームビデオのテキスト・ツー・ビデオモデルを672トークンでトレーニングすることで,計算コストを抑えることなく,長時間のビデオ生成を可能にすることを示す。

論文の概要: VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

関連論文リスト