DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025
- URL: http://arxiv.org/abs/2506.21891v1
- Date: Fri, 27 Jun 2025 04:05:12 GMT
- Title: DIVE: Deep-search Iterative Video Exploration A Technical Report for the CVRR Challenge at CVPR 2025
- Authors: Umihiro Kamoto, Tatsuya Ishibashi, Noriyuki Kugo,
- Abstract summary: In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025.<n>It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories.<n>Our method, DIVE, adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report, we present the winning solution that achieved the 1st place in the Complex Video Reasoning & Robustness Evaluation Challenge 2025. This challenge evaluates the ability to generate accurate natural language answers to questions about diverse, real-world video clips. It uses the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) benchmark, which consists of 214 unique videos and 2,400 question-answer pairs spanning 11 categories. Our method, DIVE (Deep-search Iterative Video Exploration), adopts an iterative reasoning approach, in which each input question is semantically decomposed and solved through stepwise reasoning and progressive inference. This enables our system to provide highly accurate and contextually appropriate answers to even the most complex queries. Applied to the CVRR-ES benchmark, our approach achieves 81.44% accuracy on the test set, securing the top position among all participants. This report details our methodology and provides a comprehensive analysis of the experimental results, demonstrating the effectiveness of our iterative reasoning framework in achieving robust video question answering. The code is available at https://github.com/PanasonicConnect/DIVE
Related papers
- Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought [19.792159494513424]
Video comprehension is essential for various applications ranging from video analysis to interactive systems.<n>Despite advancements in vision-language models, these models often struggle to capture nuanced,temporal details.<n>To address this, we introduce Video-Thought, a groundbreaking dataset designed to enhance video understanding.
arXiv Detail & Related papers (2025-06-10T14:08:56Z) - MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z) - PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild [164.8093566483583]
This report provides a comprehensive overview of the 4th Pixel-level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025.<n>The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion-guided, language-based video segmentation.
arXiv Detail & Related papers (2025-04-15T16:02:47Z) - AIM 2024 Challenge on Video Super-Resolution Quality Assessment: Methods and Results [76.64868221556145]
This paper presents the Video Super-Resolution (SR) Quality Assessment (QA) Challenge that was part of the Advances in Image Manipulation (AIM) workshop.
The task of this challenge was to develop an objective QA method for videos upscaled 2x and 4x by modern image- and video-SR algorithms.
The goal was to advance the state-of-the-art in SR QA, which had proven to be a challenging problem with limited applicability of traditional QA methods.
arXiv Detail & Related papers (2024-10-05T16:42:23Z) - AIM 2024 Challenge on Video Saliency Prediction: Methods and Results [105.09572982350532]
This paper reviews the Challenge on Video Saliency Prediction at AIM 2024.
The goal of the participants was to develop a method for predicting accurate saliency maps for the provided set of video sequences.
arXiv Detail & Related papers (2024-09-23T08:59:22Z) - NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results [126.78130602974319]
This paper reviews the NTIRE 2024 challenge on image super-resolution ($times$4)
The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs.
The aim of the challenge is to obtain designs/solutions with the most advanced SR performance.
arXiv Detail & Related papers (2024-04-15T13:45:48Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries
Challenge 2022 [61.81899056005645]
Given a video clip and a text query, the goal of this challenge is to locate a temporal moment of the video clip where the answer to the query can be obtained.
We propose a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips.
The experimental results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2022-07-01T12:48:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.