Fugu-MT 論文翻訳(概要): OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

論文の概要: OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

arxiv url: http://arxiv.org/abs/2509.00723v1
Date: Sun, 31 Aug 2025 07:19:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.361197
Title: OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
Title（参考訳）: OmniDPO: Omni-Modal Hallucinationに対処するための推奨最適化フレームワーク
Authors: Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu,
Abstract要約: Omni-modal large language model (OLLM) における幻覚を軽減するために設計された好み調整フレームワークであるOmniDPOを提案する。両課題に対処することにより、OmniDPOはマルチモーダルグラウンドを効果的に改善し、幻覚を減少させる。
参考スコア（独自算出の注目度）: 32.43796002503023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Omni-modal large language models (OLLMs) have sparked a new wave of research, achieving impressive results in tasks such as audio-video understanding and real-time environment perception. However, hallucination issues still persist. Similar to the bimodal setting, the priors from the text modality tend to dominate, leading OLLMs to rely more heavily on textual cues while neglecting visual and audio information. In addition, fully multimodal scenarios introduce new challenges. Most existing models align visual or auditory modalities with text independently during training, while ignoring the intrinsic correlations between video and its corresponding audio. This oversight results in hallucinations when reasoning requires interpreting hidden audio cues embedded in video content. To address these challenges, we propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in OLLMs. Specifically, OmniDPO incorporates two strategies: (1) constructing text-preference sample pairs to enhance the model's understanding of audio-video interactions; and (2) constructing multimodal-preference sample pairs to strengthen the model's attention to visual and auditory information. By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination. Experiments conducted on two OLLMs demonstrate that OmniDPO not only effectively mitigates multimodal hallucinations but also significantly enhances the models' reasoning capabilities across modalities. All code and datasets will be released upon paper acceptance.
Abstract（参考訳）: 近年,Omni-modal large language model (OLLM) が新たな研究の波を巻き起こし,音声映像理解やリアルタイム環境認識といったタスクにおいて顕著な成果を上げている。しかし、幻覚の問題は今も続いている。バイモーダル設定と同様に、テキストモダリティの先行は支配的になりがちであり、OLLMは視覚情報や音声情報を無視しながらテキストの手がかりに大きく依存する。さらに、完全なマルチモーダルシナリオは、新しい課題をもたらす。既存のモデルの多くは、映像とそれに対応する音声の内在的相関を無視しながら、訓練中に独立して視覚的・聴覚的モダリティをテキストと整合させる。この監視は、ビデオコンテンツに埋め込まれた隠れたオーディオキューを解釈する必要がある場合、幻覚を引き起こす。これらの課題に対処するため、OLLMにおける幻覚を軽減するために設計された好み調整フレームワークであるOmniDPOを提案する。特に,OmniDPOは,(1)音声とビデオの相互作用に対する理解を高めるためにテキスト参照サンプルペアを構築すること,(2)視覚および聴覚情報に対するモデルの注意を強化するためにマルチモーダル参照サンプルペアを構築すること,の2つの戦略を取り入れている。両課題に対処することにより、OmniDPOはマルチモーダルグラウンドを効果的に改善し、幻覚を減少させる。 2つのOLLM実験により、OmniDPOはマルチモーダル幻覚を効果的に緩和するだけでなく、モダリティを越えてモデルの推論能力を大幅に向上させることを示した。すべてのコードとデータセットは、論文の受理時にリリースされる。

論文の概要: OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination

関連論文リスト