Related papers: LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos

LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos

URL: http://arxiv.org/abs/2312.05269v3
Date: Tue, 05 Nov 2024 22:08:14 GMT
Title: LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos
Authors: Ying Wang, Yanlai Yang, Mengye Ren,
Abstract summary: LifelongMemory is a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. Our approach achieves state-of-the-art performance on the benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D.
Score: 15.127197238628396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we introduce LifelongMemory, a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. LifelongMemory generates concise video activity descriptions of the camera wearer and leverages the zero-shot capabilities of pretrained large language models to perform reasoning over long-form video context. Furthermore, LifelongMemory uses a confidence and explanation module to produce confident, high-quality, and interpretable answers. Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D. Code is available at https://github.com/agentic-learning-ai-lab/lifelong-memory.

Related papers

ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z)
MemLong: Memory-Augmented Retrieval for Long Text Modeling [37.49036666949963]
This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation. MemLong combines a non-differentiable ret-mem'' module with a partially trainable decoder-only language model. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs.
arXiv Detail & Related papers (2024-08-30T02:01:56Z)
Needle in the Haystack for Memory Based Large Language Models [31.885539843977472]
Current large language models (LLMs) often perform poorly on simple fact retrieval tasks. We investigate if coupling a dynamically adaptable external memory to a LLM can alleviate this problem. We demonstrate that the external memory of Larimar, which allows fast write and read of an episode of text samples, can be used at test time to handle contexts much longer than those seen during training.
arXiv Detail & Related papers (2024-07-01T16:32:16Z)
Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. In this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z)
Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding. It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected. Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z)
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding. We propose to process videos in an online manner and store past video information in a memory bank. Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z)
Language Repository for Long Video Understanding [41.17102343915504]
LangRepo maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our framework is evaluated on zero-shot visual question-answering benchmarks including Ego, NExTQA, and NExTapi-the-art performance at its scale.
arXiv Detail & Related papers (2024-03-21T17:59:35Z)
A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA) Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4) Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z)
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models [75.98775135321355]
Given a long conversation, large language models (LLMs) fail to recall past information and tend to generate inconsistent responses. We propose to generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability.
arXiv Detail & Related papers (2023-08-29T04:59:53Z)
Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception [19.627636189321393]
A promising avenue for memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos. The current technology lacks the capability to encode and store such large amounts of data efficiently. We propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database.
arXiv Detail & Related papers (2023-08-10T18:43:44Z)
Augmenting Language Models with Long-Term Memory [142.04940250657637]
Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit. We propose a framework, Language Models Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history.
arXiv Detail & Related papers (2023-06-12T15:13:39Z)
RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.