Fugu-MT 論文翻訳(概要): Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

論文の概要: Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

arxiv url: http://arxiv.org/abs/2606.18249v1
Date: Tue, 16 Jun 2026 17:59:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.600745
Title: Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Title（参考訳）: 共有コンテキスト・ビジュアル・トケナイザを用いた統一型マルチモーダル自己回帰モデリング
Authors: Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai,
Abstract要約: UniARは統合された自己回帰フレームワークであり、単一のビジュアルトークン化器が理解と生成の鍵となる。 UniARは、マルチレベル特徴融合とルックアップフリービットワイド量子化スキームを備えた事前訓練されたビジョンエンコーダを適応する。拡散に基づく視覚デコーダは、離散的な視覚トークンで高忠実度画像をデコードする。
参考スコア（独自算出の注目度）: 80.62512020268626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.
Abstract（参考訳）: Unified Multimodal Modelingは、単一のシステム内で視覚的理解と生成を統合することを目的としている。しかし、既存のアプローチは2つの異なる視覚的トークン化器に依存しており、表現空間を分割し、真に統一されたモデリングを妨げる。単一の離散的視覚トークン化器が理解と生成のキーブリッジとして機能する統合自己回帰フレームワークUniARを提案する。 UniARは、マルチレベルの特徴融合とルックアップフリーなビットワイド量子化スキームで事前訓練された視覚エンコーダを適用し、高レベルのセマンティクスと低レベルの詳細の両方を保持しながら、効果的な視覚語彙を最小限のコストでスケーリングする。これに基づいて、統合された自己回帰モデルでは、並列ビットワイズ予測を採用し、空間的にグループ化された多層視覚符号を共同で予測し、視覚列長を大幅に削減し、生成を加速する。最後に、拡散に基づく視覚デコーダは、離散的な視覚トークンで高忠実度画像をデコードする。大規模な事前トレーニングと教師付き微調整と強化学習により、UniARは画像生成と画像編集における最先端のパフォーマンスを達成し、マルチモーダル理解ベンチマークでは競争力を維持する。プロジェクトのページはhttps://sharelab-sii.github.io/uniar-web.comで公開されている。

論文の概要: Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

関連論文リスト