Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception
- URL: http://arxiv.org/abs/2308.05822v3
- Date: Fri, 18 Oct 2024 07:24:54 GMT
- Title: Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception
- Authors: Junxiao Shen, John Dudley, Per Ola Kristensson,
- Abstract summary: A promising avenue for memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos.
The current technology lacks the capability to encode and store such large amounts of data efficiently.
We propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database.
- Score: 19.627636189321393
- License:
- Abstract: We depend on our own memory to encode, store, and retrieve our experiences. However, memory lapses can occur. One promising avenue for achieving memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos, a practice commonly referred to as lifelogging. However, a significant challenge arises from the sheer volume of video data generated through lifelogging, as the current technology lacks the capability to encode and store such large amounts of data efficiently. Further, retrieving specific information from extensive video archives requires substantial computational power, further complicating the task of quickly accessing desired content. To address these challenges, we propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database. This approach harnesses the power of large vision language models to perform the language encoding process. Additionally, we propose using large language models to facilitate natural language querying. Our agent underwent extensive evaluation using the QA-Ego4D dataset and achieved state-of-the-art results with a BLEU score of 8.3, outperforming conventional machine learning models that scored between 3.4 and 5.8. Additionally, we conducted a user study in which participants interacted with the human memory augmentation agent through episodic memory and open-ended questions. The results of this study show that the agent results in significantly better recall performance on episodic memory tasks compared to human participants. The results also highlight the agent's practical applicability and user acceptance.
Related papers
- VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens.
Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z) - Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval [9.899703354116962]
Dense video captioning aims to automatically localize and caption all events within untrimmed video.
We propose a novel framework inspired by the cognitive information processing of humans.
Our model utilizes external memory to incorporate prior knowledge.
arXiv Detail & Related papers (2024-04-11T09:58:23Z) - Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations [39.05338079159942]
This study introduces a novel framework, COmpressive Memory-Enhanced Dialogue sYstems (COMEDY), which eschews traditional retrieval modules and memory databases.
Central to COMEDY is the concept of compressive memory, which intergrates session-specific summaries, user-bot dynamics, and past events into a concise memory format.
arXiv Detail & Related papers (2024-02-19T09:19:50Z) - Personalized Large Language Model Assistant with Evolving Conditional Memory [15.780762727225122]
We present a plug-and-play framework that could facilitate personalized large language model assistants with evolving conditional memory.
The personalized assistant focuses on intelligently preserving the knowledge and experience from the history dialogue with the user.
arXiv Detail & Related papers (2023-12-22T02:39:15Z) - LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos [15.127197238628396]
LifelongMemory is a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval.
Our approach achieves state-of-the-art performance on the benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D.
arXiv Detail & Related papers (2023-12-07T19:19:25Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit.
Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets.
Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
Multi-Source Multimodal Knowledge Memory [119.98011559193574]
We propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL)
It learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries.
A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data.
arXiv Detail & Related papers (2022-12-10T06:17:56Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.