Related papers: Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

URL: http://arxiv.org/abs/2411.13093v1
Date: Wed, 20 Nov 2024 07:44:34 GMT
Title: Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Authors: Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, Rongrong Ji,
Abstract summary: Video-RAG is a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment. Our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
Score: 83.00346826110041
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing large video-language models (LVLMs) struggle to comprehend long videos correctly due to limited context. To address this problem, fine-tuning long-context LVLMs and employing GPT-based agents have emerged as promising solutions. However, fine-tuning LVLMs would require extensive high-quality data and substantial GPU resources, while GPT-based agents would rely on proprietary models (e.g., GPT-4o). In this paper, we propose Video Retrieval-Augmented Generation (Video-RAG), a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment while providing additional information beyond the visual content. Specifically, we leverage open-source external tools to extract visually-aligned information from pure video data (e.g., audio, optical character, and object detection), and incorporate the extracted information into an existing LVLM as auxiliary texts, alongside video frames and queries, in a plug-and-play manner. Our Video-RAG offers several key advantages: (i) lightweight with low computing overhead due to single-turn retrieval; (ii) easy implementation and compatibility with any LVLM; and (iii) significant, consistent performance gains across long video understanding benchmarks, including Video-MME, MLVU, and LongVideoBench. Notably, our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.

Related papers

Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval [57.88666884515147]
We propose One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG)<n>OneClip-RAG makes full use of the merits of video clips for augmented video understanding.<n>It is also equipped with a novel query-guided video chunking algorithm.
arXiv Detail & Related papers (2025-12-09T09:40:20Z)
VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries. VideoRAG is powered by recent Large Video Language Models (LVLMs) We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)<n>$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.<n>Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction [10.579335027350263]
AdaCM$2$ is an adaptive cross-modality memory reduction approach to video-text alignment on video streams. It achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.
arXiv Detail & Related papers (2024-11-19T18:04:13Z)
Video Instruction Tuning With Synthetic Data [84.64519990333406]
We create a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.
arXiv Detail & Related papers (2024-10-03T17:36:49Z)
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner [53.671484175063995]
Video-LLMs are pre-trained to process short videos, limiting their broader application for understanding longer video content. We introduce an alternative video token rearrangement technique that circumvents limitations imposed by the fixed video encoder and alignment projector.
arXiv Detail & Related papers (2024-09-19T17:59:55Z)
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision [24.568643475808564]
Video Self-Training with augmented Reasoning (Video-STaR) is the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning.
arXiv Detail & Related papers (2024-07-08T17:59:42Z)
LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z)
Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding. The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning. This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models [52.83065081926238]
We propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks.
arXiv Detail & Related papers (2023-11-22T14:48:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.