Fugu-MT 論文翻訳(概要): Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

論文の概要: Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

arxiv url: http://arxiv.org/abs/2604.07146v1
Date: Wed, 08 Apr 2026 14:37:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.585414
Title: Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering
Title（参考訳）: 学習から検索へ:知識に基づく視覚的質問応答のための決定型エージェント
Authors: Zhuohong Chen, Zhenxian Wu, Yunyao Yu, Hangrui Xu, Zirui Liao, Zhifang Liu, Xiangwen Deng, Pen Jiao, Haoqian Wang,
Abstract要約: 知識に基づく視覚的質問応答(KB-VQA)は、画像を理解し、外部知識を使用するために視覚言語モデルを必要とする。ほとんどの既存の検索拡張生成(RAG)メソッドは、情報を逐次検索し、フィルタリングし、回答を生成する固定パイプラインを採用している。我々は,KB-VQAを探索エージェント問題として再定義し,その解法を多段階決定手順としてモデル化する。
参考スコア（独自算出の注目度）: 18.5913106358874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent's reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.
Abstract（参考訳）: 知識に基づく視覚的質問応答(KB-VQA)は、画像を理解して外部知識を使用するために視覚言語モデルを必要とする。ほとんどの既存の検索拡張生成(RAG)メソッドは、情報を逐次検索し、フィルタリングし、回答を生成する固定パイプラインを採用している。このような設計により、多様な質問タイプに適応することが困難になる。さらに、検索と推論を分離することで、モデルがいつ検索するか、どのようにクエリを洗練するか、いつ停止するかを決定するのが難しくなる。結果として、回収された証拠は、しばしばその問題と不一致である。これらの制約に対処するため、KB-VQAを探索エージェント問題として再構成し、多段階決定手順として解決プロセスをモデル化する。各ステップで、エージェントは現在の情報状態に基づいて、4つのアクションAnswer、Image Retrieval、Text Retrieval、Captionの1つを選択する。さらに、エージェントの推論プロセス、ツールの使用方法、中間決定を記録するマルチステップのトラジェクトリを収集する自動パイプラインを設計する。これらの軌道はその後、微調整の監督として使用される。 InfoSeek と E-VQA の実験により,本手法が最先端の性能を実現し,従来よりも一貫して性能を向上し,フレームワークの有効性を確認することができた。

論文の概要: Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

関連論文リスト