Related papers: Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

URL: http://arxiv.org/abs/2511.02182v1
Date: Tue, 04 Nov 2025 01:50:19 GMT
Title: Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models
Authors: Jinhwan Seo, Yoonki Cho, Junhyug Noh, Sung-eui Yoon,
Abstract summary: We introduce a framework to address Grounded Video Question Answering (GVQA) task for ICCV 2025 Perception Test Challenge.<n>The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally.<n>We achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
Score: 18.905799883895757
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

Related papers

Perception Test 2025: Challenge Summary and a Unified VQA Extension [56.23039846339896]
Third Perception Test challenge was organised as a full-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2025.<n>Its primary goal is to benchmark state-of-the-art video models and measure the progress in multimodal perception.<n>We summarise the results from the main Perception Test challenge, detailing both the existing tasks as well as novel additions to the benchmark.
arXiv Detail & Related papers (2026-01-09T20:02:21Z)
STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes [5.685235562999083]
STRIDE-QA is the largest visual question answering dataset fortemporal reasoning in urban driving.<n>It supports both object-centric and ego-centric reasoning through spatial localization and temporal prediction.<n>Our benchmarks demonstrate that existing Vision-Language Models (VLMs) struggle to achieve near-zero scores on prediction consistency.
arXiv Detail & Related papers (2025-08-14T07:57:06Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes [26.948071735495237]
We present TUMTraffic-VideoQA, a dataset and benchmark designed for understanding complex traffic scenarios.<n>The dataset comprises 1,000 videos, featuring 85,000 multiple-choice pairs, 2,300 object captioning, and 5,700 object annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies.
arXiv Detail & Related papers (2025-02-04T16:14:40Z)
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark [64.16672247204997]
We organised the Second Perception Test challenge as a half-day workshop alongside the IEEE/CVF European Conference on Computer Vision (ECCV) 2024.<n>The goal was to benchmarking state-of-the-art video models and measuring the progress since last year using the Perception Test benchmark.<n>This year, the challenge had seven tracks and covered low-level and high-level tasks, with language and non-language interfaces, across video, audio, and text modalities.<n>The additional track covered hour-long video understanding and introduced a novel video QA benchmark 1h-walk VQA.
arXiv Detail & Related papers (2024-11-29T18:57:25Z)
Perception Test 2023: A Summary of the First Challenge And Outcome [67.0525378209708]
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023. The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. We summarise in this report the task descriptions, metrics, baselines, and results.
arXiv Detail & Related papers (2023-12-20T15:12:27Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently. We propose a visual capsule module with a query-based selection mechanism of capsule features. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
Video Moment Retrieval via Natural Language Queries [7.611718124254329]
We propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics. Our model has a simple architecture, which enables faster training and inference while maintaining.
arXiv Detail & Related papers (2020-09-04T22:06:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.