Fugu-MT 論文翻訳(概要): OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

論文の概要: OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

arxiv url: http://arxiv.org/abs/2510.15870v2
Date: Mon, 27 Oct 2025 19:12:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 17:50:20.156093
Title: OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Title（参考訳）: OmniVinci: Omni-Modal Understanding LLMのためのアーキテクチャとデータ強化
Authors: Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov,
Abstract要約: 我々はOmniVinciを紹介します。OmniVinciは強力でオープンソースのOmni-modal LLMを構築するためのイニシアチブです。モデルアーキテクチャでは、(i)OmniAlignNetで視覚とオーディオの埋め込みのアライメントを強化する方法、(ii)視覚と音声信号の時間的アライメントをキャプチャするための時間的エンベディンググループ、(iii)オムニモーダル埋め込みにおける絶対時間的情報をエンコードするための制約付きロータリー時間エンベディングという3つの重要なイノベーションを提示する。
参考スコア（独自算出の注目度）: 146.029449832893
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
Abstract（参考訳）: マシンインテリジェンスの向上には、人間が世界を感じるように、複数のモダリティをまたいで知覚する能力を開発する必要がある。我々はOmniVinciを紹介します。OmniVinciは強力でオープンソースのOmni-modal LLMを構築するためのイニシアチブです。モデルアーキテクチャとデータキュレーションにおける設計選択について慎重に検討する。モデルアーキテクチャには3つの重要なイノベーションがあります。 i)OmniAlignNet 共有Omni-Modal潜伏空間における視覚とオーディオの埋め込みの整合性強化二視覚と音声信号の相対的時間的アライメントを捉えるための時間的埋め込みグループ化三オールニモーダル埋め込みにおいて絶対時間情報を符号化するための制約付き回転時間埋め込み単一モーダルおよび全モーダルの会話を2400万回生成するキュレーションと合成パイプラインを導入する。モダリティは知覚と推論の両方において相互に強化される。我々のモデルであるOmniVinciは、DailyOmniで+19.05、MMARで+1.7、ビデオMMEで+3.9でQwen2.5-Omniを上回ります。ついに私たちは、ロボット工学、医療AI、スマートファクトリにまたがるダウンストリームアプリケーションにおいて、Omni-modalのアドバンテージを実証しました。

論文の概要: OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

関連論文リスト