Fugu-MT 論文翻訳(概要): HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

論文の概要: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

arxiv url: http://arxiv.org/abs/2603.17024v1
Date: Tue, 17 Mar 2026 18:04:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.338646
Title: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
Title（参考訳）: HopChain: 一般化可能なビジョンランゲージ推論のためのマルチホップデータ合成
Authors: Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin,
Abstract要約: Long CoT推論は、知覚、推論、知識、幻覚のエラーなど、さまざまな障害モードを公開する。 RLVRで使われている既存の視覚言語データのほとんどは、視覚的証拠に頼っている複雑な推論連鎖を含まない。我々は、RLVRトレーニング専用にマルチホップ視覚言語推論データを合成するスケーラブルなフレームワークHopChainを提案する。
参考スコア（独自算出の注目度）: 86.82637240330791
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.
Abstract（参考訳）: VLMは強力なマルチモーダル能力を示すが、それでも細粒度の視覚言語推論に苦戦している。長いCoT推論は、知覚、推論、知識、幻覚の誤りなど、さまざまな障害モードを露呈し、中間ステップをまたいで複雑化する可能性がある。しかし、既存のRLVRで使われている視覚言語データのほとんどは、視覚的証拠に頼っている複雑な推論連鎖を含まないため、これらの弱点はほとんど明らかにされていない。そこで我々は,VLMのRLVRトレーニングに特化して,マルチホップ視覚言語推論データを合成するためのスケーラブルなフレームワークHopChainを提案する。それぞれの合成されたマルチホップクエリは、初期ホップが後続ホップに必要なインスタンス、セット、条件を確立するような、論理的に依存したインスタンスグラウンドホップの連鎖を形成する。 HopChainが合成したマルチホップデータを、Qwen3.5-35B-A3BとQwen3.5-397B-A17Bのトレーニングに使用する元のRLVRデータに加え、STEMとPuzzle、一般VQA、テキスト認識と文書理解、ビデオ理解の24ベンチマークで比較した。このマルチホップデータは特定のベンチマークをターゲットにするために合成されていないが、両方のモデルで24ベンチマーク中20ベンチマークを改善し、広範かつ一般化可能なゲインを示している。完全連鎖クエリが重要であることを示すため、半マルチホップまたはシングルホップの変種に置き換え、24ベンチマークの平均精度をそれぞれ5.3ポイントと7.0ポイント削減する。マルチホップトレーニングはまた、長いCoTの視覚言語推論を強化し、超長CoT体制では50以上の精度ポイントでゲインを達成している。これらの実験により、HopChainは、一般化可能な視覚言語推論を改善するマルチホップデータを合成するための、効果的でスケーラブルなフレームワークとして確立される。

論文の概要: HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

関連論文リスト