Fugu-MT 論文翻訳(概要): Visual-ERM: Reward Modeling for Visual Equivalence

論文の概要: Visual-ERM: Reward Modeling for Visual Equivalence

arxiv url: http://arxiv.org/abs/2603.13224v1
Date: Fri, 13 Mar 2026 17:58:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.234496
Title: Visual-ERM: Reward Modeling for Visual Equivalence
Title（参考訳）: Visual-ERM:ビジュアル等価性のためのリワードモデリング
Authors: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang,
Abstract要約: Visual Equivalence Reward Model (Visual-ERM)は、細粒度、解釈可能、タスクに依存しないフィードバックを提供するマルチモーダル生成報酬モデルである。 Visual-ERM は Qwen3-VL-8B-Instruct を 8.4 で改善し、テーブルとSVGのパースで一貫したゲインを得る。 VisualCritic-RewardBench(VC-RewardBench)は、構造化された視覚データに対して微細な画像と画像の相違を判定するためのベンチマークである。
参考スコア（独自算出の注目度）: 59.317480168347664
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.
Abstract（参考訳）: ビジョン・トゥ・コードタスクは、チャート、テーブル、SVGなどの構造化された視覚入力を、高い視覚的忠実度を持つ実行可能なあるいは構造化された表現に再構成するモデルを必要とする。近年のLVLM(Large Vision Language Models)は、教師付き微調整によって強力な結果を得られるが、報酬信号の不一致により強化学習は困難である。既存の報酬は、テキストルールや粗いビジュアル埋め込み類似性に依存するが、どちらもきめ細かい視覚的不一致を捉えず、ハッキングに弱い。視覚空間における視覚とコード間の品質を直接評価するために,細粒度,解釈可能,タスクに依存しないフィードバックを提供するマルチモーダル生成報酬モデルであるVisual Equivalence Reward Model (Visual-ERM)を提案する。 RLに統合され、Visual-ERMはQwen3-VL-8B-インストラクションをチャート・トゥ・コードで+8.4改善し、テーブルとSVGのパース(平均で+2.7、+4.1)で一貫したゲインを得る。 VisualCritic-RewardBench (VC-RewardBench)は、構造化された視覚データに対して微細な画像と画像の差を判断するベンチマークで、Visual-ERMは8BでQwen3-VL-235B-インストラクションを決定的に上回り、主要なクローズドソースモデルにアプローチする。この結果から,タスクの特異性に関わらず,細粒度の視覚報酬監督は視覚とコード間のRLに必要であり,かつ十分であることが示唆された。

論文の概要: Visual-ERM: Reward Modeling for Visual Equivalence

関連論文リスト