Fugu-MT 論文翻訳(概要): MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

論文の概要: MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

arxiv url: http://arxiv.org/abs/2511.12142v1
Date: Sat, 15 Nov 2025 10:14:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:23.636559
Title: MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering
Title（参考訳）: MAVIS: 長期視覚質問応答におけるマルチモーダルソース属性のベンチマーク
Authors: Seokwon Song, Minsu Park, Gunhee Kim,
Abstract要約: マルチモーダルソース属性システムを評価するための最初のベンチマークであるMAVISを紹介する。我々のデータセットは157Kの視覚的QAインスタンスで構成されており、各回答にはマルチモーダル文書を参照したファクトレベルの引用が注釈付けされている。本研究では,情報性,接地性,流感の3次元に沿って細粒度自動測定値を作成し,人間の判断と強い相関関係を示す。
参考スコア（独自算出の注目度）: 44.41273615523289
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Source attribution aims to enhance the reliability of AI-generated answers by including references for each statement, helping users validate the provided answers. However, existing work has primarily focused on text-only scenario and largely overlooked the role of multimodality. We introduce MAVIS, the first benchmark designed to evaluate multimodal source attribution systems that understand user intent behind visual questions, retrieve multimodal evidence, and generate long-form answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with fact-level citations referring to multimodal documents. We develop fine-grained automatic metrics along three dimensions of informativeness, groundedness, and fluency, and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs with multimodal RAG generate more informative and fluent answers than unimodal RAG, but they exhibit weaker groundedness for image documents than for text documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research. The dataset and experimental code are available at https://github.com/seokwon99/MAVIS
Abstract（参考訳）: ソース属性は、各ステートメントへの参照を含め、AI生成された回答の信頼性を高め、提供された回答の検証を支援することを目的としている。しかし、既存の研究は主にテキストのみのシナリオに焦点を合わせており、主にマルチモーダリティの役割を見落としている。視覚的疑問の背景にあるユーザの意図を理解し、マルチモーダルなエビデンスを検索し、引用によるロングフォームな回答を生成するマルチモーダルソース属性システムを評価するために設計された最初のベンチマークであるMAVISを紹介する。我々のデータセットは157Kの視覚的QAインスタンスで構成されており、各回答にはマルチモーダル文書を参照したファクトレベルの引用が注釈付けされている。本研究では,情報性,接地性,流感の3次元に沿って細粒度自動測定値を作成し,人間の判断と強い相関関係を示す。 1) マルチモーダルRAGを用いたLVLMは, マルチモーダルRAGよりも情報的, 流動的な回答を生成するが, 多モーダル設定で増幅されたギャップであるテキスト文書よりも, 画像文書の基盤性が弱い。 2) 同一のマルチモーダル文書が与えられた場合, 異なるプロンプト法における情報性と接地性の間にはトレードオフがある。 3)提案手法は,画像文書の解釈における文脈バイアスの緩和を,今後の研究にとって重要な方向として強調する。データセットと試験コードはhttps://github.com/seokwon99/MAVISで公開されている。

論文の概要: MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

関連論文リスト