Fugu-MT 論文翻訳(概要): AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

論文の概要: AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

arxiv url: http://arxiv.org/abs/2604.17488v1
Date: Sun, 19 Apr 2026 15:22:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.551312
Title: AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation
Title（参考訳）: AutoVQA-G: 自動視覚質問応答と接地アノテーションのための自己改善型エージェントフレームワーク
Authors: Rongsheng Hu, Runwei Guan, Yicheng Di, Jiayu Bao, Yuan Liu,
Abstract要約: 接地データセットで応答する高品質な視覚的質問のマニュアルアノテーションは、視覚-投機モデルの前進に不可欠である。既存の自動メソッドは、しばしば2つの重要な問題によって妨げられる。本稿では,自動VQA-Gアノテーションのための自己改善型エージェントフレームワークであるAutoVQA-Gを紹介する。
参考スコア（独自算出の注目度）: 7.909534692798243
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G
Abstract（参考訳）: 視覚的問合せと明白なグラウンド化とを組み合わせたグラウンド化(VQA-G)データセットによる高品質な視覚的質問応答のマニュアルアノテーションは、視覚言語モデル(VLM)の進展に不可欠であるが、まだ実現不可能である。既存の自動手法は,(1)モデル幻覚による不整合データ忠実度,(2)単純なヒューリスティックスに基づく脆性検証機構の2つの主要な問題によって妨げられることが多い。これらの制約に対処するため,自動VQA-Gアノテーションのための自己改善型エージェントフレームワークであるAutoVQA-Gを導入する。 AutoVQA-Gは、一貫性評価モジュールがきめ細かい視覚的検証のためにChain-of-Thought(CoT)推論を使用する反復改善ループを採用している。このフィードバックに基づいて、メモリ拡張されたPrompt Optimizationエージェントは、失敗したサンプルからの批判を分析して、生成プロンプトを徐々に洗練する。実験の結果,AutoVQA-Gは先行するマルチモーダルLCMと比較して,視覚的グラウンドニング精度のよいVQA-Gデータセットを生成し,より堅牢なVLMトレーニングと評価を容易にするために,高忠実度データを作成するための有望なアプローチを提供する。コード:https://github.com/rohnson 1999/AutoVQA-G

論文の概要: AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

関連論文リスト