Related papers: A Simple LLM Framework for Long-Range Video Question-Answering

A Simple LLM Framework for Long-Range Video Question-Answering

URL: http://arxiv.org/abs/2312.17235v2
Date: Mon, 26 Feb 2024 17:29:30 GMT
Title: A Simple LLM Framework for Long-Range Video Question-Answering
Authors: Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
Abstract summary: We present LLoVi, a language-based framework for long-range video question-answering (LVQA) Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4) Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
Score: 66.68887077133355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

Related papers

Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames [70.93346841539626]
We present Temporal Chain of Thought, an inference strategy for video question-answering.<n>We use the VLM itself to iteratively identify and extract the most relevant frames from the video.<n>We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy.
arXiv Detail & Related papers (2025-07-01T18:39:26Z)
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs [59.854331104466254]
We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles. We propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages. Our results demonstrate substantial improvements over the state of the art on the recent VidChapters-7M benchmark.
arXiv Detail & Related papers (2025-03-31T17:41:29Z)
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs [8.18451834099348]
Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.
arXiv Detail & Related papers (2025-03-25T17:17:19Z)
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs [27.473258727617477]
Long video understanding presents unique challenges due to the complexity of reasoning over extended timespans. We propose a framework VideoINSTA, i.e. INformative Spatial-TemporAl Reasoning for long-form video understanding. Our model significantly improves the state-of-the-art on three long video question-answering benchmarks.
arXiv Detail & Related papers (2024-09-30T15:04:14Z)
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z)
Hallucination Mitigation Prompts Long-term Video Understanding [36.26790392889717]
This paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. We use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. During the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers.
arXiv Detail & Related papers (2024-06-17T08:44:03Z)
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA [40.54207548074378]
Long-form videos that span across wide temporal intervals are highly information redundant. All information necessary to generate a correct response can often be contained within a small subset of frames.
arXiv Detail & Related papers (2024-06-13T17:59:16Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Natural Language Video Localization: A Revisit in Span-based Question Answering Framework [56.649826885121264]
Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision. We address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage.
arXiv Detail & Related papers (2021-02-26T15:57:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.