Fugu-MT 論文翻訳(概要): ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

論文の概要: ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

arxiv url: http://arxiv.org/abs/2605.20278v2
Date: Sun, 24 May 2026 12:22:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 16:32:37.765289
Title: ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
Title（参考訳）: ClaimDiff-RL:視覚的クレーム比較による細粒字字字幕強化学習
Authors: Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng,
Abstract要約: ClaimDiff-RLは、参照条件付き原子クレーム差分をキャプションRLの報酬単位として使用するフレームワークである。 ClaimDiff-RLは幻覚のバランスを改善し、一般的な能力を保ち、いくつかの細粒度キャパビリティー次元のGemini-3-Pro-Previewを超えている。
参考スコア（独自算出の注目度）: 38.42736245144838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.
Abstract（参考訳）: ロングフォーム画像キャプションは、RLにおける報酬粒度の問題を露呈する:キャプションはシーケンス全体として判断されるが、重要なエラーは個々の視覚的クレームのレベルで発生する。高い濃度のキャプションは忠実かつ情報的であり、良心的な詳細を省くことなく幻覚を避けるべきである。しかし、ペアワイズな選好、参照ベースのメトリクス、および全体論的スカラー報酬は、これらのローカルエラーを単一のシーケンスレベルの信号に圧縮し、事実とカバレッジのトレードオフを無視する。本稿では、参照条件付き原子クレーム差分を用いたフレームワークであるCrimDiff-RLについて、キャプションRLの報酬単位として紹介する。画像、アクターキャプション、参照キャプションが与えられた場合、マルチモーダルジャッジは、視覚的に根拠付けられた差分を列挙し、画像に対して各差分を検証し、開語彙エラータイプと重大度レベルを割り当て、報酬合成のための差分統計を生成する。これにより、幻覚的な主張と、別々に測定可能で学習可能とされていた敬意的な事実が省略される。実験により、全体的なスカラー報酬は、欠落した事実を増大させることで幻覚を減少させ、一方、CrimDiff-RLは、この忠実さとカバレッジのトレードオフを露呈し、よりバランスの取れた操作ポイントを可能にしている。 160イメージの人間ラベル付き診断ベンチマーク、公開キャプションベンチマーク、VQAベンチマークでは、ClaymDiff-RLは幻覚-消耗バランスを改善し、一般的な能力を保ち、オブジェクトカウント、空間関係、シーン認識など、いくつかの細かい機能ディメンジョンに関するGemini-3-Pro-Previewを越えている。以上の結果から, タイプド, 検証可能なクレーム差は, 細粒度, 診断可能なキャプションRLに有効な報奨単位であることが示唆された。

論文の概要: ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

関連論文リスト