VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
- URL: http://arxiv.org/abs/2506.06097v1
- Date: Fri, 06 Jun 2025 13:58:31 GMT
- Title: VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning
- Authors: Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang,
- Abstract summary: VideoChat-A1 is a novel long video agent paradigm.<n>It can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm.<n>By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process.
- Score: 40.071064407275564
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.
Related papers
- HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [20.184894298462652]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.<n>HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)<n>We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs)<n>This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark.<n>We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - LongVILA: Scaling Long-Context Visual Language Models for Long Videos [86.28679075537089]
LongVILA is a full-stack solution for long-context visual-language models.<n>LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.
arXiv Detail & Related papers (2024-08-19T17:48:08Z) - VideoQA in the Era of LLMs: An Empirical Study [108.37456450182054]
Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-intuitive tasks.<n>This work conducts a timely and comprehensive study of Video-LLMs' behavior in VideoQA.<n>Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents.<n>However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments.
arXiv Detail & Related papers (2024-08-08T05:14:07Z) - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths.
We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content.
Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z) - InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding [31.147208579511247]
InfiniBench is a benchmark for understanding long videos ranging from tens of minutes to several hours.
It presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows.
Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding.
arXiv Detail & Related papers (2024-06-28T12:35:01Z) - DrVideo: Document Retrieval Based Long Video Understanding [44.34473173458403]
DrVideo is a document-retrieval-based system designed for long video understanding.
It first transforms a long video into a coarse text-based long document to retrieve key frames and then updates the documents with the augmented key frame information.
It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered.
arXiv Detail & Related papers (2024-06-18T17:59:03Z) - MovieChat+: Question-aware Sparse Memory for Long Video Question Answering [36.14140811797466]
We propose MovieChat to overcome the challenges of understanding long videos.
We use tokens in Transformers as the carriers of memory in combination with our specially designed memory mechanism.
MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method.
arXiv Detail & Related papers (2024-04-26T06:17:04Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [28.316828641898375]
VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video.
2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task.
arXiv Detail & Related papers (2024-03-18T05:07:59Z) - A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA)
Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4)
Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.