Fugu-MT 論文翻訳(概要): MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

論文の概要: MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

arxiv url: http://arxiv.org/abs/2604.27393v1
Date: Thu, 30 Apr 2026 04:05:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.918094
Title: MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Title（参考訳）: MiniCPM-o 4.5:リアルタイムフルダブルプレックスオムニモードインタラクションを目指して
Authors: Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hongliang Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang Zeng, Chaojun Xiao, Yankai Lin, Xu Han, Maosong Sun, Zhiyuan Liu, Yuan Yao,
Abstract要約: Mini-o 4.5は、人間レベルのリアルタイムストリーミングインタラクションに向けた最新の取り組みです。 Omni-CPMは、オムニモードの知覚と出力を共有時間軸に沿って整列する統合ストリーミングフレームワークである。合計9Bパラメータで、Mini-o 4.5は視力計算能力においてGemini 2.5 Flashにアプローチし、最先端のオープンな計算性能を提供する。
参考スコア（独自算出の注目度）: 76.4461698685681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting proactive behaviors such as issuing reminders or comments based on its continuous understanding of the live scene. The key technique behind MiniCPM-o 4.5 is Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis. This formulation converts conventional turn-based interaction into a full-duplex, time-aligned process, enabling simultaneous perception and response and allowing proactive behavior to arise within the same framework. With a total of 9B parameters, MiniCPM-o 4.5 approaches Gemini 2.5 Flash in vision-language capabilities, delivering state-of-the-art open-source performance at its scale. It also surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and delivers better speech generation, with significantly higher computation efficiency. Driven by its efficient architecture design and inference optimization, the model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)の最近の進歩は、静的オフラインデータ処理からリアルタイムストリーミングインタラクションへのAI機能をもたらしているが、人間レベルのマルチモーダルインタラクションには程遠いままである。主要なボトルネックは、もはやモダリティカバレッジやレイテンシのみではなく、インタラクションパラダイムそのものです。第一に、知覚と応答は相変わらず分離され、モデルが生成中の時間的調整のために新しい入力を組み込むのを防ぐ。第二に、現在のモデルの多くはリアクティブであり、進化するマルチモーダル環境で積極的に動作するのではなく、明示的なユーザ要求にのみ応答する。実時間2倍のOmni-Modalインタラクションによってこれらのギャップを緩和する,人間のようなマルチモーダルインタラクションに向けた最新の取り組みであるMiniCPM-o 4.5を提案する。また、ライブシーンの継続的な理解に基づいて、リマインダーやコメントを発行するといったプロアクティブな行動も提示する。 MiniCPM-o 4.5の背後にある重要な技術は、オムニフロー(Omni-Flow)である。この定式化は、従来のターンベースインタラクションをフル2倍のタイムアライメントプロセスに変換し、同時認識と応答を可能にし、同じフレームワーク内で積極的に行動を起こすことを可能にする。合計9Bパラメータを持つMiniCPM-o 4.5は、ビジョン言語機能でGemini 2.5 Flashにアプローチし、最先端のオープンソースパフォーマンスをその規模で提供する。また、Qwen3-Omni-30B-A3Bを超え、より優れた音声生成を実現し、計算効率が大幅に向上した。効率的なアーキテクチャ設計と推論最適化によって、モデルは12GBのRAMコスト未満のエッジデバイス上で、リアルタイムのフルダブルプレックスオムニモーダルインタラクションを実行することができる。

論文の概要: MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

関連論文リスト