Fugu-MT 論文翻訳(概要): Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

論文の概要: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

arxiv url: http://arxiv.org/abs/2505.07538v1
Date: Mon, 12 May 2025 13:19:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-13 20:21:49.393812
Title: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Title（参考訳）: 拡散と推論のための自己回帰の離散的視覚トークン
Authors: Bohan Wang, Zhongqi Yue, Fengda Zhang, Shuo Chen, Li'an Bi, Junzhe Zhang, Xue Song, Kennard Yanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, Hanwang Zhang,
Abstract要約: 自己整合性トークン化装置(Selftok)について紹介する。設計コアでは、画像生成の逆拡散過程を用いて、自己回帰(AR)を視覚トークンに先立って構成する。
参考スコア（独自算出の注目度）: 62.39335643853649
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We completely discard the conventional spatial prior in image representation and introduce a novel discrete visual tokenizer: Self-consistency Tokenizer (Selftok). At its design core, we compose an autoregressive (AR) prior -- mirroring the causal structure of language -- into visual tokens by using the reverse diffusion process of image generation. The AR property makes Selftok fundamentally distinct from traditional spatial tokens in the following two key ways: - Selftok offers an elegant and minimalist approach to unify diffusion and AR for vision-language models (VLMs): By representing images with Selftok tokens, we can train a VLM using a purely discrete autoregressive architecture -- like that in LLMs -- without requiring additional modules or training objectives. - We theoretically show that the AR prior satisfies the Bellman equation, whereas the spatial prior does not. Therefore, Selftok supports reinforcement learning (RL) for visual generation with effectiveness comparable to that achieved in LLMs. Besides the AR property, Selftok is also a SoTA tokenizer that achieves a favorable trade-off between high-quality reconstruction and compression rate. We use Selftok to build a pure AR VLM for both visual comprehension and generation tasks. Impressively, without using any text-image training pairs, a simple policy gradient RL working in the visual tokens can significantly boost the visual generation benchmark, surpassing all the existing models by a large margin. Therefore, we believe that Selftok effectively addresses the long-standing challenge that visual tokens cannot support effective RL. When combined with the well-established strengths of RL in LLMs, this brings us one step closer to realizing a truly multimodal LLM. Project Page: https://selftok-team.github.io/report/.
Abstract（参考訳）: 画像表現における従来の空間的先行性を完全に排除し、新しい離散的視覚トークン化ツール、Self-Consistency Tokenizer(Selftok)を導入する。デザインコアでは、画像生成の逆拡散プロセスを用いて、言語因果構造を反映した自己回帰前処理(AR)を視覚トークンに構成する。 selftokは、視覚言語モデル(VLM)のための拡散とARを統一するためのエレガントで最小限のアプローチを提供する。 selftokトークンでイメージを表現することで、LLMのような純粋に離散的な自己回帰的アーキテクチャを使って、VLMをトレーニングできます。 - 理論的には,AR はベルマン方程式を満たすが,空間的先行は満足しない。したがって、SelftokはLLMに匹敵する視覚生成のための強化学習(RL)をサポートする。 ARプロパティに加えて、Selftokは、高品質な再構築と圧縮率とのトレードオフを良好に達成するSoTAトークンも備えている。視覚的理解と生成の両方のための純粋なAR VLMを構築するために、Selftokを使用します。驚くべきことに、テキストイメージのトレーニングペアを使わずに、ビジュアルトークンで機能するシンプルなポリシー勾配RLは、既存のモデルをすべて大きなマージンで上回る、視覚生成ベンチマークを大幅に向上させることができる。そこで我々は,視覚トークンが効果的なRLをサポートできないという長年の課題に対して,Selftokが効果的に対処できると考えている。 LLMにおけるRLの強みと組み合わせることで、真のマルチモーダルLLMの実現に一歩近づくことができる。 Project Page: https://selftok-team.github.io/report/.com

論文の概要: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning

関連論文リスト