Memory-enhanced Retrieval Augmentation for Long Video Understanding
- URL: http://arxiv.org/abs/2503.09149v1
- Date: Wed, 12 Mar 2025 08:23:32 GMT
- Title: Memory-enhanced Retrieval Augmentation for Long Video Understanding
- Authors: Huaying Yuan, Zheng Liu, Minhao Qin, Hongjin Qian, Y Shu, Zhicheng Dou, Ji-Rong Wen,
- Abstract summary: We introduce a novel RAG-based LVU approach inspired by the cognitive memory of human beings, which is called MemVid.<n>Our approach operates with four basics steps: memorizing holistic video information, reasoning about the task's information needs based on the memory, retrieving critical moments based on the information needs, and focusing on the retrieved moments to produce the final answer.
- Score: 57.371543819761555
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) shows strong potential in addressing long-video understanding (LVU) tasks. However, traditional RAG methods remain fundamentally limited due to their dependence on explicit search queries, which are unavailable in many situations. To overcome this challenge, we introduce a novel RAG-based LVU approach inspired by the cognitive memory of human beings, which is called MemVid. Our approach operates with four basics steps: memorizing holistic video information, reasoning about the task's information needs based on the memory, retrieving critical moments based on the information needs, and focusing on the retrieved moments to produce the final answer. To enhance the system's memory-grounded reasoning capabilities and achieve optimal end-to-end performance, we propose a curriculum learning strategy. This approach begins with supervised learning on well-annotated reasoning results, then progressively explores and reinforces more plausible reasoning outcomes through reinforcement learning. We perform extensive evaluations on popular LVU benchmarks, including MLVU, VideoMME and LVBench. In our experiment, MemVid significantly outperforms existing RAG-based methods and popular LVU models, which demonstrate the effectiveness of our approach. Our model and source code will be made publicly available upon acceptance.
Related papers
- MemInsight: Autonomous Memory Augmentation for LLM Agents [12.620141762922168]
We propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms.
We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization.
arXiv Detail & Related papers (2025-03-27T17:57:28Z) - Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs [0.0]
We propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training.
Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs.
We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models.
arXiv Detail & Related papers (2025-03-18T19:29:07Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement [85.08223786819532]
Existing large language models (LLMs) show exceptional problem-solving capabilities but might struggle with complex reasoning tasks.<n>We propose textbfRAG-Star, a novel RAG approach that integrates retrieved information to guide the tree-based deliberative reasoning process.<n>Our experiments involving Llama-3.1-8B-Instruct and GPT-4o demonstrate that RAG-Star significantly outperforms previous RAG and reasoning methods.
arXiv Detail & Related papers (2024-12-17T13:05:36Z) - AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant [23.366991558162695]
Large Language Models generate factually incorrect information, known as "hallucination"
To cope with these challenges, we propose Assistant-based Retrieval-Augmented Generation (AssistRAG)
This assistant manages memory and knowledge through tool usage, action execution, memory building, and plan specification.
arXiv Detail & Related papers (2024-11-11T09:03:52Z) - Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG [36.754491649652664]
Retrieval-augmented generation (RAG) empowers large language models (LLMs) to utilize external knowledge sources.
This paper investigates the detrimental impact of retrieved "hard negatives" as a key contributor.
To mitigate this and enhance the robustness of long-context LLM-based RAG, we propose both training-free and training-based approaches.
arXiv Detail & Related papers (2024-10-08T12:30:07Z) - Vision-Language Navigation with Continual Learning [10.850410419782424]
Vision-language navigation (VLN) is a critical domain within embedded intelligence.
We propose the Vision-Language Navigation with Continual Learning paradigm to address this challenge.
In this paradigm, agents incrementally learn new environments while retaining previously acquired knowledge.
arXiv Detail & Related papers (2024-09-04T09:28:48Z) - Improve Temporal Awareness of LLMs for Sequential Recommendation [61.723928508200196]
Large language models (LLMs) have demonstrated impressive zero-shot abilities in solving a wide range of general-purpose tasks.
LLMs fall short in recognizing and utilizing temporal information, rendering poor performance in tasks that require an understanding of sequential data.
We propose three prompting strategies to exploit temporal information within historical interactions for LLM-based sequential recommendation.
arXiv Detail & Related papers (2024-05-05T00:21:26Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Class-Incremental Continual Learning into the eXtended DER-verse [17.90483695137098]
This work aims at assessing and overcoming the pitfalls of our previous proposal Dark Experience Replay (DER)
Inspired by the way our minds constantly rewrite past recollections and set expectations for the future, we endow our model with the abilities to i) revise its replay memory to welcome novel information regarding past data.
We show that the application of these strategies leads to remarkable improvements.
arXiv Detail & Related papers (2022-01-03T17:14:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.