Related papers: VideoRAG: Retrieval-Augmented Generation over Video Corpus

VideoRAG: Retrieval-Augmented Generation over Video Corpus

URL: http://arxiv.org/abs/2501.05874v1
Date: Fri, 10 Jan 2025 11:17:15 GMT
Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus
Authors: Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang,
Abstract summary: VideoRAG is a novel framework that dynamically retrieves relevant videos based on their relevance with queries.<n>We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
Score: 57.68536380621672
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.

Related papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding [45.83476222676765]
We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions.<n>The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process.<n>We show that VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks.
arXiv Detail & Related papers (2025-07-17T17:59:59Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
WikiVideo: Article Generation from Multiple Videos [67.59430517160065]
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple videos about real-world events. We introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims. We propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos.
arXiv Detail & Related papers (2025-04-01T16:22:15Z)
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos [25.770675590118547]
VideoRAG is the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features.
arXiv Detail & Related papers (2025-02-03T17:30:19Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content. We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension [83.00346826110041]
Video-RAG is a training-free and cost-effective pipeline that employs visually-aligned auxiliary texts to help facilitate cross-modality alignment. Our model demonstrates superior performance over proprietary models like Gemini-1.5-Pro and GPT-4o when utilized with a 72B model.
arXiv Detail & Related papers (2024-11-20T07:44:34Z)
GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Towards Retrieval Augmented Generation over Large Video Libraries [0.0]
We introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps.
arXiv Detail & Related papers (2024-06-21T07:52:01Z)
TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment [42.557643515992005]
Video understanding remains a challenge despite the availability of substantial web video-text data. We introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding.
arXiv Detail & Related papers (2024-05-22T18:35:10Z)
iRAG: Advancing RAG for Videos with an Incremental Approach [3.486835161875852]
One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times. We propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of video data. iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos.
arXiv Detail & Related papers (2024-04-18T16:38:02Z)
HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models [11.883785732720094]
We present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. To bring richer information into video and text, we propose a hallucination-based augmentation method. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
arXiv Detail & Related papers (2024-04-07T21:46:47Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information. Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z)
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z)
Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval. This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z)
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.