Fugu-MT 論文翻訳(概要): Multimodal QUD: Inquisitive Questions from Scientific Figures

論文の概要: Multimodal QUD: Inquisitive Questions from Scientific Figures

arxiv url: http://arxiv.org/abs/2604.23733v1
Date: Sun, 26 Apr 2026 14:25:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.527728
Title: Multimodal QUD: Inquisitive Questions from Scientific Figures
Title（参考訳）: マルチモーダルQUD:科学的考察からの質問
Authors: Yating Wu, William Rudman, Venkata S Govindarajan, Alexandros G. Dimakis, Junyi Jessy Li,
Abstract要約: 我々は、科学的論文に携わるときに人間が生み出す疑問の深さに達する質問を生成する。我々は、QUDの言語理論をテキストのみからマルチモーダルに拡張する。 MQUD上でのVLMの微調整により、一般的な低レベルの視覚的質問からコンテンツ固有のグラウンドへとモデルをシフトすることを示す。
参考スコア（独自算出の注目度）: 63.41049609329304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper's context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.
Abstract（参考訳）: 読みながら質問をし、答えを探し出すことは、人間の言論理解、好奇心、創造的思考において重要な要素であり、先行研究はテキストのみのシナリオでこれを調査してきた。しかし、科学論文や研究論文では、重要な出発点の多くは、それらを解析する図形とテキストの両方を通して伝達される。 VLM(Vision-Language Models)の能力を評価するために科学的視覚化が使用されているが、現在のベンチマークは、単にそれらから情報を取り出すことに焦点を当てた質問に限られている。このような質問は下層の推論のみを必要とし、図形が現れる状況を考慮してはならず、著者が達成したいと思うコミュニケーション目標を反映しない。我々は、科学的論文に携わるときに人間が生み出す質問の深さに到達し、図形と論文の文脈の両方に条件付けし、両方のモダリティをまたいだ推論を必要とする質問を生成する。そこで本研究では,QUDの言語理論をテキストのみからマルチモーダルに拡張し,言論が進むにつれて暗黙の疑問が提起され解決される。 MQUD は研究論文のデータセットであり、そのような質問は原著者によって明確化され、注釈付けされる。 MQUD 上での VLM の微調整により、汎用的な低レベルな視覚的質問から、高レベルのマルチモーダル推論を必要とするコンテンツ固有のグラウンドにモデルを移行し、より高品質で視覚的なマルチモーダル QUD 生成を実現することを示す。

論文の概要: Multimodal QUD: Inquisitive Questions from Scientific Figures

関連論文リスト