Fugu-MT 論文翻訳(概要): Emu3.5: Native Multimodal Models are World Learners

論文の概要: Emu3.5: Native Multimodal Models are World Learners

arxiv url: http://arxiv.org/abs/2510.26583v1
Date: Thu, 30 Oct 2025 15:11:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.876625
Title: Emu3.5: Native Multimodal Models are World Learners
Title（参考訳）: Emu3.5: ネイティブマルチモーダルモデルは世界学習者である
Authors: Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang,
Abstract要約: Emu3.5は大規模マルチモーダル世界モデルで、視覚と言語をまたいだ次の状態をネイティブに予測する。 Emu3.5は、視覚言語間のインターリーブデータのコーパスに基づいて、一貫した次トーケン予測目標を持つ、エンドツーエンドで事前訓練された。それは、一貫した世界探索とオープンワールドの具体的操作を可能にする、一般化可能な世界モデリング能力を示す。
参考スコア（独自算出の注目度）: 65.85558430499516
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
Abstract（参考訳）: Emu3.5は大規模マルチモーダル世界モデルで、視覚と言語をまたいだ次の状態をネイティブに予測する。 Emu3.5は10兆以上のトークンを含む視覚言語インターリーブデータのコーパス上に、主にインターネットビデオのシーケンシャルフレームと書き起こしから導かれる、統一された次世代の予測目標を持つ、エンドツーエンドのトレーニング済みである。このモデルは、インターリーブされた視覚言語入力を自然に受け入れ、インターリーブされた視覚言語出力を生成する。 Emu3.5は、マルチモーダル推論と生成を強化するために、大規模強化学習でさらに訓練されている。推論効率を向上させるために,トークン・バイ・トークン・デコーディングを双方向並列予測に変換する離散拡散適応(DiDA)を提案し,性能を犠牲にすることなく画像毎の推論を約20倍高速化する。 Emu3.5は、長距離ビジョン言語生成、X2I生成、複雑なテキストリッチな画像生成など、強力なネイティブなマルチモーダル機能を備えている。また、一般化可能な世界モデリング能力を示し、時空間的に一貫した世界探索と、様々なシナリオやタスクにまたがるオープンワールドの具体的操作を可能にしている。比較として、Emu3.5は画像生成および編集タスクにおけるGemini 2.5 Flash Image(Nano Banana)に匹敵するパフォーマンスを実現し、インターリーブされた生成タスクのスイートにおいて優れた結果を示す。コミュニティリサーチをサポートするために、Emu3.5をhttps://github.com/baaivision/Emu3.5でオープンソース化しました。

論文の概要: Emu3.5: Native Multimodal Models are World Learners

関連論文リスト