Fugu-MT 論文翻訳(概要): Online Self-Calibration Against Hallucination in Vision-Language Models

論文の概要: Online Self-Calibration Against Hallucination in Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.00323v1
Date: Fri, 01 May 2026 01:03:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.805212
Title: Online Self-Calibration Against Hallucination in Vision-Language Models
Title（参考訳）: 視覚言語モデルにおける幻覚に対するオンライン自己校正
Authors: Minghui Chen, Chenxu Yang, Hengjie Zhu, Dayan Wu, Zheng Lin, Qingyi Si,
Abstract要約: LVLM(Large Vision-Language Models)はしばしば幻覚に悩まされ、入力画像にない視覚的詳細を含む記述を生成する。 textbfOnline textbfSelf-textbfCAlibtextbfRation (OSCAR) を提案する。
参考スコア（独自算出の注目度）: 23.13137973421435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.
Abstract（参考訳）: LVLM(Large Vision-Language Models)はしばしば幻覚に悩まされ、入力画像から欠落した視覚的詳細を含む記述を生成する。最近の選好アライメント法は、通常、GPTのようなより強いモデルから蒸留された監督に依存する。しかし、このオフラインのパラダイムはスーパービジョン・パーセプション・ミスマッチ(Supervision-Perception Mismatch)を導入している。オンライン学習のための信頼性の高い自己スーパービジョンを得るために,LVLM内の生成-識別ギャップを同定する。この機能を活用して、モンテカルロ木探索とデュアルグラニュラリティー・リワード機構を統合して嗜好データを構築し、直接選好最適化を通じてモデルを反復的に洗練するフレームワークである、textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR)を提案する。大規模な実験により、OSCARは幻覚ベンチマークにおける最先端のパフォーマンスを達成し、一般的なマルチモーダル能力を向上することを示した。

論文の概要: Online Self-Calibration Against Hallucination in Vision-Language Models

関連論文リスト