Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA
- URL: http://arxiv.org/abs/2509.24445v1
- Date: Mon, 29 Sep 2025 08:28:44 GMT
- Title: Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA
- Authors: Jianxin Liang, Tan Yue, Yuxuan Wang, Yueqian Wang, Zhihan Yin, Huishuai Zhang, Dongyan Zhao,
- Abstract summary: We introduce a framework to synthesize richer supervisory signals.<n>We propose two complementary strategies: Question-Based Paraphrasing (QBP) and Question-Based Captioning (QBC)
- Score: 37.679936989592996
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The performance of Video Question Answering (VideoQA) models is fundamentally constrained by the nature of their supervision, which typically consists of isolated, factual question-answer pairs. This "bag-of-facts" approach fails to capture the underlying narrative and causal structure of events, limiting models to a shallow understanding of video content. To move beyond this paradigm, we introduce a framework to synthesize richer supervisory signals. We propose two complementary strategies: Question-Based Paraphrasing (QBP), which synthesizes the diverse inquiries (what, how, why) from a video's existing set of question-answer pairs into a holistic narrative paragraph that reconstructs the video's event structure; and Question-Based Captioning (QBC), which generates fine-grained visual rationales, grounding the answer to each question in specific, relevant evidence. Leveraging powerful generative models, we use this synthetic data to train VideoQA models under a unified next-token prediction objective. Extensive experiments on STAR and NExT-QA validate our approach, demonstrating significant accuracy gains and establishing new state-of-the-art results, such as improving a 3B model to 72.5\% on STAR (+4.9\%) and a 7B model to 80.8\% on NExT-QA. Beyond accuracy, our analysis reveals that both QBP and QBC substantially enhance cross-dataset generalization, with QBP additionally accelerating model convergence by over 2.5x. These results demonstrate that shifting data synthesis from isolated facts to narrative coherence and grounded rationales yields a more accurate, efficient, and generalizable training paradigm.
Related papers
- VIPER: Process-aware Evaluation for Generative Video Reasoning [64.86465792516658]
We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning.<n>Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking.
arXiv Detail & Related papers (2025-12-31T16:31:59Z) - Q-Save: Towards Scoring and Attribution for Generated Video Evaluation [65.83319736145869]
We present Q-Save, a new benchmark dataset and model for holistic evaluation of AI-generated video (AIGV) quality.<n>The dataset contains near 10000 videos, each annotated with a scalar mean opinion score (MOS) and fine-grained attribution labels.<n>We propose a unified evaluation model that jointly performs quality scoring and attribution-based explanation.
arXiv Detail & Related papers (2025-11-24T07:00:21Z) - Foundational Question Generation for Video Question Answering via an Embedding-Integrated Approach [0.0]
We introduce Foundational Question Generation for Video Questioning via an Embedding-Integrated Approach (FIQ)<n>FIQ is a framework designed to enhance the reasoning capability of VQA models by improving their foundational comprehension of video content.<n> Experimental results on the SUTD-TrafficQA dataset demonstrate that FIQ achieves state-of-the-art performance, surpassing existing baseline approaches.
arXiv Detail & Related papers (2025-11-18T13:45:50Z) - ImplicitQA: Going beyond frames towards Implicit Video Reasoning [36.65883181090953]
ImplicitQA is a novel benchmark designed to test models on implicit reasoning.<n>It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - Causality Model for Semantic Understanding on Videos [0.0]
This thesis focuses on the domain of semantic video understanding.<n>It explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA)
arXiv Detail & Related papers (2025-03-16T10:44:11Z) - Admitting Ignorance Helps the Video Question Answering Models to Answer [82.22149677979189]
We argue that models often establish shortcuts, resulting in spurious correlations between questions and answers.<n>We propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question.<n>In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness.
arXiv Detail & Related papers (2025-01-15T12:44:52Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - Causal Understanding For Video Question Answering [2.749898166276854]
Video Question Answering is a challenging task, which requires the model to reason over multiple frames and understand the interaction between different objects to answer questions based on the context provided within the video.
Previous approaches leverage either sub-sampled information or causal intervention techniques along with complete video features to tackle the NExT-QA task.
In this work we elicit the limitations of these approaches and propose solutions along four novel directions of improvements on the NExT-QA dataset.
arXiv Detail & Related papers (2024-07-23T06:32:46Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - SRQA: Synthetic Reader for Factoid Question Answering [21.28441702154528]
We introduce a new model called SRQA, which means Synthetic Reader for Factoid Question Answering.
This model enhances the question answering system in the multi-document scenario from three aspects.
We perform SRQA on the WebQA dataset, and experiments show that our model outperforms the state-of-the-art models.
arXiv Detail & Related papers (2020-09-02T13:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.