Fugu-MT 論文翻訳(概要): Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

論文の概要: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

arxiv url: http://arxiv.org/abs/2506.18898v1
Date: Mon, 23 Jun 2025 17:59:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-24 19:06:37.119888
Title: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
Title（参考訳）: 方言としての視覚:テキスト対応表現による視覚理解と生成の統合
Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang,
Abstract要約: 本稿では,共有意味表現における視覚的理解と生成を統一する枠組みを提案する。中心となるのはText-Aligned Tokenizer (TA-Tok) で、これは大きな言語モデル(LLM)の語彙から投影されたテキスト整列コードブックを用いて画像を個別のトークンに変換する。ベンチマークによる実験では、Tarは既存のマルチモーダルLLMメソッドと一致し、より高速な収束とトレーニング効率の向上を実現している。
参考スコア（独自算出の注目度）: 33.11867433769496
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com
Abstract（参考訳）: 本稿では,共有された個別の意味表現の中で視覚的理解と生成を統一しようとするマルチモーダルフレームワークを提案する。中心となるのはText-Aligned Tokenizer (TA-Tok) で、これは大きな言語モデル(LLM)の語彙から投影されたテキスト整列コードブックを用いて画像を個別のトークンに変換する。視覚とテキストを拡張語彙で統一された空間に組み込むことで、モダリティ固有の設計を必要とせずに、多モードのLLMであるTarは、共有インターフェースを介して、クロスモーダルな入力と出力を可能にします。さらに,高忠実度な視覚出力を生成するための生成デトケナイザとともに,効率と視覚的ディテールのバランスをとるためのスケール適応符号化とデコードを提案する。多様な復号化ニーズに対応するために,高速自己回帰モデルと拡散モデルという2つの相補的復号化手法を用いる。モーダリティ・フュージョンを強化するために,先進的な事前学習タスクについて検討し,視覚的理解と生成の両面での改善を実証した。ベンチマークによる実験では、Tarは既存のマルチモーダルLLMメソッドと一致し、より高速な収束とトレーニング効率の向上を実現している。コード、モデル、データはhttps://tar.csuhan.comで入手できる。

論文の概要: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

関連論文リスト