HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning
- URL: http://arxiv.org/abs/2412.14585v1
- Date: Thu, 19 Dec 2024 07:06:25 GMT
- Title: HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning
- Authors: Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim,
- Abstract summary: Interest in dense video captioning (DVC) has been on the rise.
Several studies highlight the challenges of utilizing prior knowledge, such as pre-training and external memory.
We propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory.
- Score: 9.899703354116962
- License:
- Abstract: With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.
Related papers
- ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.
We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.
We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing [33.720656946186885]
Hierarchical Memory Transformer (HMT) is a novel framework that facilitates a model's long-context processing ability.
HMT consistently improves the long-context processing ability of existing models.
arXiv Detail & Related papers (2024-05-09T19:32:49Z) - Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition [62.85802939587308]
This paper focuses on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR)
Since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting.
We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models.
arXiv Detail & Related papers (2024-01-11T23:00:24Z) - Empowering Working Memory for Large Language Model Agents [9.83467478231344]
This paper explores the potential of applying cognitive psychology's working memory frameworks to large language models (LLMs)
An innovative model is proposed incorporating a centralized Working Memory Hub and Episodic Buffer access to retain memories across episodes.
This architecture aims to provide greater continuity for nuanced contextual reasoning during intricate tasks and collaborative scenarios.
arXiv Detail & Related papers (2023-12-22T05:59:00Z) - Encode-Store-Retrieve: Augmenting Human Memory through Language-Encoded Egocentric Perception [19.627636189321393]
A promising avenue for memory augmentation is through the use of augmented reality head-mounted displays to capture and preserve egocentric videos.
The current technology lacks the capability to encode and store such large amounts of data efficiently.
We propose a memory augmentation agent that involves leveraging natural language encoding for video data and storing them in a vector database.
arXiv Detail & Related papers (2023-08-10T18:43:44Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin
Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores.
We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores.
XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z) - A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental
Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement.
We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work.
We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z) - Hierarchical Memory Matching Network for Video Object Segmentation [38.24999776705497]
We propose two advanced memory read modules that enable us to perform memory in multiple scales while exploiting temporal smoothness.
We first propose a guided memory matching module that replaces the non-local dense memory read, commonly adopted in previous memory-based methods.
We introduce a hierarchical memory matching scheme and propose a top-k guided memory matching module in which memory read on a fine-scale is guided by that on a coarse-scale.
arXiv Detail & Related papers (2021-09-23T14:36:43Z) - MART: Memory-Augmented Recurrent Transformer for Coherent Video
Paragraph Captioning [128.36951818335046]
We propose a new approach called Memory-Augmented Recurrent Transformer (MART)
MART uses a memory module to augment the transformer architecture.
MART generates more coherent and less repetitive paragraph captions than baseline methods.
arXiv Detail & Related papers (2020-05-11T20:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.