SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding
- URL: http://arxiv.org/abs/2506.07600v1
- Date: Mon, 09 Jun 2025 10:00:54 GMT
- Title: SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding
- Authors: Nianbo Zeng, Haowen Hou, Fei Richard Yu, Si Shi, Ying Tiffany He,
- Abstract summary: We present SceneRAG, a framework to segment videos into narrative-consistent scenes.<n>For each scene, the framework fuses information from both visual and textual modalities to extract entity relations.<n>Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines.
- Score: 6.980340270823506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.
Related papers
- Enhancing Long Video Question Answering with Scene-Localized Frame Grouping [19.83545369186771]
Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding.<n>We propose a new scenario under the video question-answering task, SceneQA.<n>We introduce a novel method called SLFG to combine individual frames into semantically coherent scene frames.
arXiv Detail & Related papers (2025-08-05T02:28:58Z) - InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO [73.33751812982342]
InfLVG is an inference-time framework that enables coherent long video generation without requiring additional long-form video data.<n>We show that InfLVG can extend video length by up to 9$times$, achieving strong consistency and semantic fidelity across scenes.
arXiv Detail & Related papers (2025-05-23T07:33:25Z) - WikiVideo: Article Generation from Multiple Videos [67.59430517160065]
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple videos about real-world events.<n>We introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims.<n>We propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos.
arXiv Detail & Related papers (2025-04-01T16:22:15Z) - VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos [25.770675590118547]
VideoRAG is the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos.<n>Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features.
arXiv Detail & Related papers (2025-02-03T17:30:19Z) - Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation [16.80010133425332]
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content.<n>Presto achieves 78.5% on the VBench Semantic Score and 100% splits on the Dynamic Degree, outperforming existing state-of-the-art video generation methods.
arXiv Detail & Related papers (2024-12-02T09:32:36Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.