Fugu-MT 論文翻訳(概要): Lance: Unified Multimodal Modeling by Multi-Task Synergy

論文の概要: Lance: Unified Multimodal Modeling by Multi-Task Synergy

arxiv url: http://arxiv.org/abs/2605.18678v2
Date: Wed, 20 May 2026 11:14:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 14:55:44.3229
Title: Lance: Unified Multimodal Modeling by Multi-Task Synergy
Title（参考訳）: Lance: Multi-Task Synergy による統一マルチモーダルモデリング
Authors: Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang,
Abstract要約: Lanceは、画像とビデオの両方のマルチモーダル理解、生成、編集をサポートする軽量なネイティブ統一モデルである。スクラッチからトレーニングされ、共有されたインターリーブされたマルチモーダルシーケンス上で、デュアルストリーム・ミックス・オブ・サーキットアーキテクチャを採用している。実験により、Lanceは既存のオープンソース統一モデルよりも画像およびビデオ生成において大幅に優れていることが示された。
参考スコア（独自算出の注目度）: 50.81778765489668
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.
Abstract（参考訳）: 画像とビデオの両方のマルチモーダル理解、生成、編集をサポートする軽量なネイティブ統一モデルであるLanceを紹介する。モデルキャパシティのスケーリングやテキストイメージに支配的な設計に頼るのではなく、Lance氏は協調マルチタスクトレーニングによる統一マルチモーダルモデリングの実践的パラダイムを探求している。これは、統合されたコンテキストモデリングと分離された機能経路の2つの基本原則に基づいている。具体的には、Lanceはスクラッチからトレーニングされ、共有されたインターリーブされたマルチモーダルシーケンスに、デュアルストリームのミックス・オブ・エキスパートアーキテクチャを使用し、理解と生成のための経路を分離しながら、共同でコンテキスト学習を可能にする。さらに、不均一な視覚トークン間の干渉を緩和し、クロスタスクアライメントを高めるために、モダリティ対応の回転位置符号化を導入する。トレーニング中、Lanceは、機能指向の目標と適応データスケジューリングを備えたステージドマルチタスクトレーニングパラダイムを採用し、セマンティック理解と視覚生成の両方のパフォーマンスを強化する。実験の結果,Lanceは画像およびビデオ生成において既存のオープンソース統一モデルよりも大幅に優れており,マルチモーダル理解能力は高いことがわかった。ホームページはhttps://lance-project.github.ioで公開されている。

論文の概要: Lance: Unified Multimodal Modeling by Multi-Task Synergy

関連論文リスト