Related papers: TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

URL: http://arxiv.org/abs/2404.01476v1
Date: Mon, 1 Apr 2024 20:58:24 GMT
Title: TraveLER: A Multi-LMM Agent Framework for Video Question-Answering
Authors: Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig,
Abstract summary: TraveLER is a framework that iteratively collects relevant information from design details through interactive question-asking. We find that the proposed TraveLER approach improves performance on several video question-answering benchmarks.
Score: 48.55956886819481
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets.

Related papers

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning [37.86612817818566]
We propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames.<n>Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames.<n>This, in turn, leads to improved performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2025-05-31T00:08:21Z)
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation [49.1574468325115]
This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA)<n>It combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment.<n>This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks.
arXiv Detail & Related papers (2025-05-21T18:32:43Z)
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering [11.514596823413736]
Video Question Answering (VQA) inherently relies on multimodal reasoning. We introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions.
arXiv Detail & Related papers (2025-04-25T22:08:09Z)
VidCtx: Context-aware Video Question Answering with Image Models [15.1350316858766]
We introduce VidCtx, a novel training-free VideoQA framework which integrates both visual information from input frames and textual descriptions of others frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models.
arXiv Detail & Related papers (2024-12-23T09:26:38Z)
DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning [2.4121373594852846]
Multi-Task Learning makes the visual task acquire the rich shareable knowledge from other tasks while joint training. Heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem. Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task.
arXiv Detail & Related papers (2024-08-29T01:25:36Z)
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding. We propose to process videos in an online manner and store past video information in a memory bank. Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z)
m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks [31.031053149807857]
We introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools. For each of these task queries, we provide automatically generated plans using this realistic toolset. We provide a high-quality subset of 1,565 task plans that are human-verified and correctly.
arXiv Detail & Related papers (2024-03-17T04:36:18Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR) With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.