Fugu-MT 論文翻訳(概要): VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

論文の概要: VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

arxiv url: http://arxiv.org/abs/2605.28023v1
Date: Wed, 27 May 2026 06:27:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.801169
Title: VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
Title（参考訳）: VCap: 弱めとストロングのビジュアルキャプションのためのハイパージオメトリ・リワード
Authors: Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan,
Abstract要約: 本稿では,参照キャプション(証人)と視覚信号(代弁者)とをペアにする,証人-代弁者報酬であるVCapを提案する。 VCapは、キャプションの品質検証のための超幾何分布レベルの精度を持つ報酬信号を提供する。実験では,VCapでトレーニングした8Bモデルは,複数の画像およびビデオキャプションベンチマーク上で,オープンソースおよびクローズドソースSOTAモデルより優れていた。
参考スコア（独自算出の注目度）: 57.588999592609646
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.
Abstract（参考訳）: 視覚的なキャプションは、排便と幻覚を最小化しながら、視覚的コンテンツを忠実にキャプチャするモデルを必要とする。 MLLMはキャプションの主要なパラダイムとして、スケーリングと高品質なデータを通じて強力なパフォーマンスを実現している。近年、RLはMLLMを高い精度と広範なカバレッジに向けて駆動するための重要なルートとして登場したが、既存のキャプションの報酬設計では、事実検証のための微細で信頼性の高い信号が得られず、その有効性は制限されている。そこで本研究では,視覚信号(補助者)と参照キャプション(証人)をペアリングする,証人-代弁者報酬であるVCapを提案する。視覚信号に接地された基準とポリシー生成キャプション間の事実整合性を明示的に検証することにより、VCapは、キャプション品質検証のためのハイパージオメトリ配信レベル精度の報酬信号を提供する。この設計により、不完全な参照から効果的な学習が可能となり、RLトレーニングにおける弱い対強の一般化が容易になる。実験では,VCapでトレーニングした8Bモデルは,複数の画像およびビデオキャプションベンチマーク上で,オープンソースおよびクローズドソースSOTAモデルより優れていた。人間の評価はさらに、事実の正しさと強く一致していることを確認する。さらに、VCapはMLLMの知覚能力を改善し、タスクを一般化し、最高のN蒸留を超越し、RLVRに関する前提に挑戦する。

論文の概要: VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

関連論文リスト