Fugu-MT 論文翻訳(概要): VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

論文の概要: VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

arxiv url: http://arxiv.org/abs/2604.21396v1
Date: Thu, 23 Apr 2026 08:04:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.377848
Title: VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
Title（参考訳）: VG-CoT:グラウンドド・オブ・サートによる信頼できるビジュアル推論を目指して
Authors: Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun, YoungBin Kim,
Abstract要約: 画像内の実際の視覚的証拠に、各推論ステップを明示的にリンクするVisual Grounding Chain-of-Thoughtデータセットを提案する。パイプラインは、GPT-4oでステップバイステップのグラウンドド推論を生成し、合理的に駆動されるオープンセット検出プロセスを通じてグラウンドディングを洗練する。 LLaVA-1.5やQwen2-VLを含む代表的なLVLMによる実験は、ほとんどの評価指標に対して一貫した改善を示した。
参考スコア（独自算出の注目度）: 16.361394107862502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
Abstract（参考訳）: LVLM(Large Vision-Language Models)の進歩は、実際の視覚的証拠にモデルの論理を忠実に基礎付ける、正確な局所的推論を必要とする。しかし、既存のデータセットは、広範囲な手動アノテーションと、モデル信頼性の評価を制限する多段階推論と対応する画像領域との明示的な整合性の欠如により、スケーラビリティの限界に直面している。これらの課題に対処するために、完全に自動化された3段階のパイプラインを通じて、各推論ステップを画像内の実際の視覚的エビデンスに明示的にリンクするVisual Grounding Chain-of-Thought(VG-CoT)データセットを提案する。パイプラインはまず、最先端検出モデルとOCRモデルを用いてオブジェクトレベルとテキストレベルの視覚的証拠を抽出し、次に、GPT-4oによるステップバイステップの基底的推論を生成し、最後に、合理的に駆動されたオープンセット検出プロセスを通じてグラウンドを洗練する。さらに,Rationale Quality, Answer Accuracy, Reasoning-Answer Alignmentの3つの相補的な次元におけるLVLM推論を総合的に評価する新しいベンチマークを導入する。 LLaVA-1.5やQwen2-VLといった代表的LVLMによる実験は、ほとんどの評価指標に対して一貫した改善を示し、VG-CoTは、スケーラブルで費用効率のよいデータセット構築を維持しながら、信頼性の高いエビデンスベースの推論を効果的に強化することを確認した。データセットとコードは、さらなる研究を促進するために、受け入れられ次第公開されます。

論文の概要: VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

関連論文リスト