MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
- URL: http://arxiv.org/abs/2506.18071v2
- Date: Fri, 27 Jun 2025 06:32:43 GMT
- Title: MUPA: Towards Multi-Path Agentic Reasoning for Grounded Video Question Answering
- Authors: Jisheng Dang, Huilin Song, Junbin Xiao, Bimei Wang, Han Peng, Haoxuan Li, Xun Yang, Meng Wang, Tat-Seng Chua,
- Abstract summary: Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence.<n>In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation.
- Score: 64.46361702927457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grounded Video Question Answering (Grounded VideoQA) requires aligning textual answers with explicit visual evidence. However, modern multimodal models often rely on linguistic priors and spurious correlations, resulting in poorly grounded predictions. In this work, we propose MUPA, a cooperative MUlti-Path Agentic approach that unifies video grounding, question answering, answer reflection and aggregation to tackle Grounded VideoQA. MUPA features three distinct reasoning paths on the interplay of grounding and QA agents in different chronological orders, along with a dedicated reflection agent to judge and aggregate the multi-path results to accomplish consistent QA and grounding. This design markedly improves grounding fidelity without sacrificing answer accuracy. Despite using only 2B parameters, our method outperforms all 7B-scale competitors. When scaled to 7B parameters, MUPA establishes new state-of-the-art results, with Acc@GQA of 30.3% and 47.4% on NExT-GQA and DeVE-QA respectively, demonstrating MUPA' effectiveness towards trustworthy video-language understanding. Our code is available in https://github.com/longmalongma/MUPA.
Related papers
- TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision [47.12557166147296]
We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup.<n>We generate an open-ended answer grounded with the start and end time.<n>We instruct-tune TOGA to jointly generate the answer and the temporal grounding.
arXiv Detail & Related papers (2025-06-11T06:52:31Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation [49.1574468325115]
This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA)<n>It combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment.<n>This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks.
arXiv Detail & Related papers (2025-05-21T18:32:43Z) - Cross-modal Causal Relation Alignment for Video Question Grounding [44.97933293141372]
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers.<n>Existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question.<n>We propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding.
arXiv Detail & Related papers (2025-03-05T01:36:32Z) - Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.<n>This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.<n>We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.