HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
- URL: http://arxiv.org/abs/2408.17443v3
- Date: Sat, 09 Nov 2024 06:46:41 GMT
- Title: HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
- Authors: Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu,
- Abstract summary: HERMES is a model that simulates episodic memory accumulation to capture action sequences.
Episodic COmpressor efficiently aggregates crucial representations from micro to semi-macro levels.
Semantic ReTRiever dramatically reduces feature dimensionality while preserving relevant macro-level information.
- Score: 32.117677036812836
- License:
- Abstract: Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels, overcoming the challenge of long-range dependencies. Second, we propose a Semantics ReTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. This addresses the issues of redundancy and lack of high-level concept extraction. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
Related papers
- Position: Episodic Memory is the Missing Piece for Long-Term LLM Agents [43.94686139164999]
We present an episodic memory framework for Large Language Models (LLMs) agents, centered around five key properties of episodic memory.
This position paper argues that now is the right time for an explicit, integrated focus on episodic memory to catalyze the development of long-term agents.
arXiv Detail & Related papers (2025-02-10T19:14:51Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.
This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z) - SEAL: Semantic Attention Learning for Long Video Representation [31.994155533019843]
This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos.
To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities.
Our representation is versatile, enabling applications across various long video understanding tasks.
arXiv Detail & Related papers (2024-12-02T18:46:12Z) - SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding [50.337896542603524]
We introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding.
Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages.
Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-04-01T15:26:44Z) - Temporal Insight Enhancement: Mitigating Temporal Hallucination in
Multimodal Large Language Models [20.33971942003996]
This study introduces an innovative method to address event-level hallucinations in MLLMs.
We propose a unique mechanism that decomposes on-demand event queries into iconic actions.
We employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences.
arXiv Detail & Related papers (2024-01-18T10:18:48Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Hierarchical Deep Residual Reasoning for Temporal Moment Localization [48.108468456043994]
We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics.
We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
arXiv Detail & Related papers (2021-10-31T07:13:34Z) - Interpretable Time-series Representation Learning With Multi-Level
Disentanglement [56.38489708031278]
Disentangle Time Series (DTS) is a novel disentanglement enhancement framework for sequential data.
DTS generates hierarchical semantic concepts as the interpretable and disentangled representation of time-series.
DTS achieves superior performance in downstream applications, with high interpretability of semantic concepts.
arXiv Detail & Related papers (2021-05-17T22:02:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.