2026-04-26 Daily Report: Multimodal QUD: Inquisitive Questions from Scientific Figures

Multimodal QUD: Inquisitive Questions from Scientific Figures

Authors Yating Wu, William Rudman, Venkata S Govindarajan, Alexandros G. Dimakis, Junyi Jessy Li

Affiliations BespokeLabs.ai / University of California, Berkeley / The University of Texas at Austin / Ithaca College

Categories Task / Question Generation / Generating inquisitive questions from multimodal scientific data, Method / Vision-Language Models / Fine-tuning VLMs for question generation, Application / Scientific Document Analysis / Understanding scientific figures through QUD

License CC BY 4.0

Abstract Overview

This paper extends the linguistic Question Under Discussion (QUD) framework from text-only discourse to multimodal scientific discourse, where figures and surrounding paper context jointly trigger implicit questions. The authors introduce MQUD, a dataset of 1,250 multimodal QUDs from 245 figures across 56 papers in NLP, machine learning, and astronomy, with annotations from 17 original paper authors across seven dimensions. They propose two reusable diagnostics—relative information gain (rIG) and within-paper figure swap—to test whether vision-language models genuinely ground in figure content rather than responding to generic visual input. Fine-tuning a VLM on MQUD shifts question generation from generic low-level visual questions toward content-specific, visually grounded scientific questions.

Novelty

The main novelty is extending QUD theory to multimodal scientific discourse, treating figures as discourse participants that raise implicit questions not triggered by text alone. The paper also introduces MQUD as the first dataset targeting figure–text interaction with verified figure specificity and original-author salience judgments, along with two grounding diagnostics: relative information gain and within-paper figure swap tests.

Results

Fine-tuning Qwen3.5-9B on MQUD increased relative information gain from 0.60 to 0.97 and shifted figure-swap behavior from a generic visual-input bias (12% swap-positive) to content-specific grounding (75% swap-positive), with 82% swap-positive on a paper-disjoint evaluation set. GPT-4o showed figure sensitivity (rIG 0.72) but weak content-specific grounding (18% swap positivity). In LLM-judge evaluation, the fine-tuned model was preferred over the base model on depth (75%), figure specificity (64%), and question diversity (78%).

Key Points

The paper formalizes multimodal QUDs as questions triggered jointly by scientific figures and paper context, distinguishing figure-driven questions (comparison, extent) from integration questions (cause, consequence, procedural, concept) that require cross-modal reasoning.
MQUD contains 1,250 validated questions from 245 figures across 56 papers, with seven-dimensional annotations including salience, figure usefulness, and answer correctness; 703 QUDs are annotated by 17 original paper authors as domain experts.
The proposed diagnostics (rIG and figure swap) demonstrate that supervised fine-tuning shifts a VLM from generating generic visual questions toward content-specific, visually grounded scientific questions, a capability that prompting alone—including with GPT-4o—does not achieve.

References

arXiv: https://arxiv.org/abs/2604.23733v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.23733v1
Project: http://lingchensanwen.github.io/multimodal-qud/

Project