Related papers: VITED: Video Temporal Evidence Distillation

VITED: Video Temporal Evidence Distillation

URL: http://arxiv.org/abs/2503.12855v1
Date: Mon, 17 Mar 2025 06:30:02 GMT
Title: VITED: Video Temporal Evidence Distillation
Authors: Yujie Lu, Yale Song, William Wang, Lorenzo Torresani, Tushar Nagarajan,
Abstract summary: We investigate complex video question answering via chain-of-evidence reasoning.<n>Models struggle with multi-step reasoning as they uniformly sample a fixed number of frames.<n>We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains.
Score: 49.38292490256531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate complex video question answering via chain-of-evidence reasoning -- identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.

Related papers

Process-of-Thought Reasoning for Videos [33.74677144833003]
Process-of-Thought (PoT) Reasoning for Videos is a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps.<n>PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence.
arXiv Detail & Related papers (2026-02-07T20:25:46Z)
Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning [32.71093573332936]
VideoDR is the first video deep research benchmark for studying video agents in open-web settings.<n>VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence.
arXiv Detail & Related papers (2026-01-11T15:07:37Z)
Video-BrowseComp: Benchmarking Agentic Video Research on Open Web [64.53060049124961]
Video-BrowseComp is a benchmark comprising 210 questions tailored for open-web agentic video reasoning.<n>It enforces a mandatory dependency on temporal visual evidence, ensuring answers cannot be derived solely through text search.<n>As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
arXiv Detail & Related papers (2025-12-28T19:08:27Z)
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z)
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding [36.30263540665245]
We propose a framework for effective and reliable long video understanding.<n>SeViCES is training-free and model-agnostic, and introduces two key components.<n> experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness.
arXiv Detail & Related papers (2025-10-23T14:55:28Z)
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning [29.811030252357195]
multimodal large language models (MLLMs) are crucial for downstream tasks like video question answering and temporal grounding.<n>We propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework.
arXiv Detail & Related papers (2025-08-06T13:03:21Z)
ImplicitQA: Going beyond frames towards Implicit Video Reasoning [36.65883181090953]
ImplicitQA is a novel benchmark designed to test models on implicit reasoning.<n>It comprises 1K meticulously annotated QA pairs derived from 320+ high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z)
Emergent Temporal Correspondences from Video Diffusion Transformers [30.83001895223298]
We introduce DiffTrack, the first quantitative analysis framework designed to answer this question.<n>Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching.<n>We extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training.
arXiv Detail & Related papers (2025-06-20T17:59:55Z)
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning [33.37714717781103]
VideoMind is a novel video-language agent designed for temporal-grounded video understanding.<n>We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow.<n>We propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors.
arXiv Detail & Related papers (2025-03-17T17:59:33Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos. We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z)
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset. We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z)
Transforming Multi-Concept Attention into Video Summarization [36.85535624026879]
We propose a novel attention-based framework for video summarization with complex video data. Our model can be applied to both labeled and unlabeled data, making our method preferable to real-world applications.
arXiv Detail & Related papers (2020-06-02T06:23:50Z)
Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios [55.82693757287532]
Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. We present a video analytic platform that integrates visual and audio analytic modules and fuses information from surveillance cameras and video uploads from eyewitnesses.
arXiv Detail & Related papers (2020-04-02T14:29:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.