Fugu-MT 論文翻訳(概要): AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

論文の概要: AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

arxiv url: http://arxiv.org/abs/2603.07394v1
Date: Sun, 08 Mar 2026 00:48:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.400788
Title: AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions
Title（参考訳）: AQuA:曖昧な視覚的質問に対する戦略応答生成に向けて
Authors: Jihyoung Jang, Hyounghun Kim,
Abstract要約: 本稿では、あいまいなVQAインスタンスを4つのレベルに分類する、きめ細かいデータセットであるAmpliguous Visual Question Answering(AQuA)を紹介する。あいまいなVQAに対してAQuAが戦略的応答生成を実現し、あいまいさを認識し、不確実性を管理し、文脈に適合した戦略に応答する能力を示す。
参考スコア（独自算出の注目度）: 5.891896951832169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Question Answering (VQA) is a core task for evaluating the capabilities of Vision-Language Models (VLMs). Existing VQA benchmarks primarily feature clear and unambiguous image-question pairs, whereas real-world scenarios often involve varying degrees of ambiguity that require nuanced reasoning and context-appropriate response strategies. Although recent studies have begun to address ambiguity in VQA, they lack (1) a systematic categorization of ambiguity levels and (2) datasets and models that support strategy-aware responses. In this paper, we introduce Ambiguous Visual Question Answering (AQuA), a fine-grained dataset that classifies ambiguous VQA instances into four levels according to the nature and degree of ambiguity, along with the optimal response strategy for each case. Our evaluation of diverse open-source and proprietary VLMs shows that most models fail to adapt their strategy to the ambiguity type, frequently producing overconfident answers rather than seeking clarification or acknowledging uncertainty. To address this challenge, we fine-tune VLMs on AQuA, enabling them to adaptively choose among multiple response strategies, such as directly answering, inferring intent from contextual cues, listing plausible alternatives, or requesting clarification. VLMs trained on AQuA achieve strategic response generation for ambiguous VQA, demonstrating the ability to recognize ambiguity, manage uncertainty, and respond with context-appropriate strategies, while outperforming both open-source and closed-source baselines.
Abstract（参考訳）: VQA(Visual Question Answering)は、視覚言語モデル(VLM)の機能を評価するための中核的なタスクである。既存のVQAベンチマークは主に明確で曖昧なイメージクエストペアを特徴とするが、現実のシナリオでは、曖昧な推論とコンテキストに適した応答戦略を必要とする曖昧さの度合いが異なることが多い。近年、VQAにおけるあいまいさに対処する研究が始まっているが、(1)あいまいさレベルを体系的に分類し、(2)戦略対応の応答をサポートするデータセットとモデルが欠落している。本稿では,あいまいなVQAインスタンスを,各ケースに対して最適な応答戦略とともに,あいまいなVQAインスタンスの性質と程度に応じて4つのレベルに分類する,きめ細かなデータセットであるAmiguous Visual Question Answering(AQuA)を紹介する。多様なオープンソースおよびプロプライエタリなVLMの評価は、ほとんどのモデルがあいまいさタイプに適応できず、不確実性を明らかにすることよりも、自信過剰な回答を頻繁に生み出していることを示している。この課題に対処するため、我々はAQuA上でVLMを微調整し、直接応答、文脈的手がかりからの推論、妥当な選択肢の一覧、明確化の要求など、複数の応答戦略の中から適応的に選択することができる。 AQuAでトレーニングされたVLMは曖昧なVQAの戦略的応答生成を実現し、曖昧さを認識し、不確実性を管理し、コンテキストに適合した戦略に応答する能力を示しながら、オープンソースとクローズドソースの両方のベースラインを上回っている。

論文の概要: AQuA: Toward Strategic Response Generation for Ambiguous Visual Questions

関連論文リスト