Fugu-MT 論文翻訳(概要): CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

論文の概要: CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

arxiv url: http://arxiv.org/abs/2603.06449v1
Date: Fri, 06 Mar 2026 16:39:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:46.209881
Title: CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
Title（参考訳）: CaTok: 一次元因果画像トークン化のための平均フローのモデリング
Authors: Yitong Chen, Zuxuan Wu, Xipeng Qiu, Yu-Gang Jiang,
Abstract要約: 本稿では,MeanFlowデコーダを備えた1次元因果画像トークンであるCaTokを紹介する。時間間隔でトークンを選択することで、CaTokは高速なワンステップ生成と高忠実なマルチステップサンプリングの両方をサポートする因果1D表現を学ぶ。実験により、CaTokはImageNet再構成の最先端の結果を達成し、0.75 FID、22.53 PSNR、0.674 SSIMに達した。
参考スコア（独自算出の注目度）: 122.88484422855934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.
Abstract（参考訳）: 自己回帰(AR)言語モデルは因果トークン化に依存しているが、このパラダイムをビジョンに拡張することは簡単ではない。現在のビジュアル・トークンーザは2Dパッチを非因果配列に平らにするか、あるいは「次世代の予測」パターンと誤認するヒューリスティックな順序を強制するかのどちらかである。最近の拡散オートエンコーダも同様に短くなる:全てのトークンにデコーダを条件付けすることは因果性に欠けるが、ネストされたドロップアウト機構の適用は不均衡をもたらす。これらの課題に対処するために、MeanFlowデコーダを備えた1D因果画像トークンであるCaTokを紹介する。図1に示すように、トークンを時間間隔で選択し、MeanFlowの目的に結び付けることで、CaTokは、高速なワンステップ生成と高忠実なマルチステップサンプリングの両方をサポートする因果1D表現を学び、トークン間隔をまたいだ多様な視覚概念を自然にキャプチャする。トレーニングのさらなる安定化と高速化を目的として,エンコーダの機能をVFM(Vision Foundation Models)に整合させる,簡単な正規化REPA-Aを提案する。実験では、CaTokがImageNet再構築の最先端の結果を達成し、トレーニングエポックの少ない0.75 FID、22.53 PSNR、0.674 SSIMに達した。

論文の概要: CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

関連論文リスト