Fugu-MT 論文翻訳(概要): Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

論文の概要: Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

arxiv url: http://arxiv.org/abs/2602.20161v1
Date: Mon, 23 Feb 2026 18:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.973117
Title: Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Title（参考訳）: Mobile-O: モバイルデバイス上での統合マルチモーダル理解と生成
Authors: Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan,
Abstract要約: 我々は,モバイル端末に統一されたマルチモーダルインテリジェンスを実現する,コンパクトな視覚言語拡散モデルであるMobile-Oを提案する。そのコアモジュールであるモバイルコンディショニング・プロジェクタ(MCP)は、奥行き分離可能な畳み込みと階層的アライメントを用いた拡散生成器で視覚言語の特徴を融合させる。 iPhone上では512x512イメージあたり3秒でしか動作しないMobile-Oは、エッジデバイス上でリアルタイムに統一されたマルチモーダル理解と生成を行うための最初の実践的なフレームワークを確立している。
参考スコア（独自算出の注目度）: 90.46496321553843
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
Abstract（参考訳）: 統一マルチモーダルモデルは、単一のアーキテクチャ内で視覚的コンテンツを理解し、生成することができる。しかし、既存のモデルは依然としてデータ不足であり、エッジデバイスへのデプロイには重すぎる。我々は,モバイル端末に統一されたマルチモーダルインテリジェンスを実現する,コンパクトな視覚言語拡散モデルであるMobile-Oを提案する。そのコアモジュールであるモバイルコンディショニング・プロジェクタ(MCP)は、奥行き分離可能な畳み込みと階層的アライメントを用いた拡散生成器で視覚言語の特徴を融合させる。この設計により、計算コストを最小限に抑えた効率的なクロスモーダルコンディショニングが可能となる。数百万のサンプルをトレーニングし、新しい四重項形式(生成プロンプト、画像、質問、回答)でポストトレーニングされたMobile-Oは、視覚的理解と生成能力の両方を共同で強化する。その効率性にもかかわらず、Mobile-Oは他の統一モデルと比較して競争力や優れたパフォーマンスを達成し、GenEvalで74%、Show-OとJanusFlowで5%、JanusFlowで11%、それぞれ6倍、11倍の速さでパフォーマンスを達成した。視覚的理解では、Mobile-Oは7つのベンチマークで平均15.3%、平均5.1%を上回っている。 iPhone上で512x512イメージあたり3秒程度しか動作しないMobile-Oは、エッジデバイス上でリアルタイムに統一されたマルチモーダル理解と生成を行うための最初の実践的フレームワークを確立している。 Mobile-Oがクラウドに依存しないオンデバイスで動作するリアルタイム統合マルチモーダルインテリジェンスにおける将来の研究を容易にしてくれることを期待しています。私たちのコード、モデル、データセット、モバイルアプリケーションはhttps://amshaker.github.io/Mobile-O/で公開されています。

論文の概要: Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

関連論文リスト