Fugu-MT 論文翻訳(概要): Self-Rewarding Vision-Language Model via Reasoning Decomposition

論文の概要: Self-Rewarding Vision-Language Model via Reasoning Decomposition

arxiv url: http://arxiv.org/abs/2508.19652v1
Date: Wed, 27 Aug 2025 08:01:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-28 19:07:41.550896
Title: Self-Rewarding Vision-Language Model via Reasoning Decomposition
Title（参考訳）: 推論分解による自己回帰視覚言語モデル
Authors: Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu,
Abstract要約: VLM(Vision-Language Models)はしばしば視覚幻覚に悩まされ、実際に画像にないものや言語ショートカットが語られる。本稿では,外部視覚監督に頼らずに視覚推論を改善する自己回帰手法であるVision-SR1を紹介する。我々の実験は、Vision-SR1が視覚的推論を改善し、視覚幻覚を緩和し、言語ショートカットへの依存を減らすことを示した。
参考スコア（独自算出の注目度）: 49.784411666601905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
Abstract（参考訳）: VLM(Vision-Language Models)は、視覚的な幻覚に悩まされ、実際に画像にないものや、視覚的な部分をスキップしてテキストの先行にのみ依存する言語ショートカットを言う。これらの問題は、VLMのほとんどのポストトレーニング手法が単純な検証可能な解マッチングに頼っており、最終的な結果のみを監督し、中間的な視覚的推論を明示的なガイダンスなしで残しているためである。結果として、VLMは疎い視覚信号を受け取り、しばしば視覚的知覚よりも言語に基づく推論を優先することを学ぶ。これを軽減するために、既存の手法では人間のアノテーションを使った視覚的な監視や、外部の大型モデルからのラベルの蒸留が加えられている。しかし、人間のアノテーションは労働集約的でコストがかかるため、外部の信号は進化するポリシーに適応できないため、ハッキングに報いる可能性がある分散シフトを引き起こす。本稿では,自己回帰手法であるVision-SR1について紹介する。 Vision-SR1はVLM推論を視覚知覚と言語推論の2つの段階に分解する。モデルはまず、入力画像を参照することなく、質問に答えるのに十分な自己完結型視覚知覚を生成するよう促される。この自己完結性を検証するために、同じVLMモデルを再試行して、生成された知覚のみを入力として言語推論を行い、報酬を計算する。この自己回帰は最終的な出力の監督と組み合わせられ、視覚知覚と言語推論の両方を強化するバランスの取れた訓練信号を提供する。我々の実験は、視覚-SR1が視覚的推論を改善し、視覚幻覚を緩和し、様々な視覚言語タスクにおける言語ショートカットへの依存を減らすことを示した。

論文の概要: Self-Rewarding Vision-Language Model via Reasoning Decomposition

関連論文リスト