Related papers: LET-US: Long Event-Text Understanding of Scenes

LET-US: Long Event-Text Understanding of Scenes

URL: http://arxiv.org/abs/2508.07401v1
Date: Sun, 10 Aug 2025 16:02:41 GMT
Title: LET-US: Long Event-Text Understanding of Scenes
Authors: Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu,
Abstract summary: Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution.<n>We introduce LET-US, a framework for long event-stream--text comprehension.<n>We use an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details.
Score: 23.376693904132786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks -- reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.

Related papers

EventFlash: Towards Efficient MLLMs for Event-Based Vision [55.65520031675231]
Event-based multimodal language models (LMLMs) enable robust perception in high-speed and low-light scenarios.<n>We build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets.<n>We present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens.<n>We believe EventFlash serves as an efficient foundation model for event-based vision.
arXiv Detail & Related papers (2026-02-03T08:06:45Z)
EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models [56.16721798968254]
We propose an event-guided, training-free framework for efficient understanding, named EventSTU.<n>In the temporal domain, we design a coarse-to-fine sampling algorithm that the change-triggered property of event cameras to eliminate redundant large frames.<n>In the spatial domain, we achieves an adaptive token pruning algorithm that leverages the saliency of events as a zero-cost prior to guide spatial reduction.
arXiv Detail & Related papers (2025-11-24T09:30:02Z)
Multi-Level LVLM Guidance for Untrimmed Video Action Recognition [0.0]
This paper introduces the Event-Temporalized Video Transformer (ECVT), a novel architecture that bridges the gap between low-level visual features and high-level semantic information.<n>Experiments on ActivityNet v1.3 and THUMOS14 demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14.
arXiv Detail & Related papers (2025-08-24T16:45:21Z)
EventVL: Understand Event Streams via Multimodal Large Language Model [18.57504605615107]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding.<n> Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset.<n>To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z)
EventGPT: Event Stream Understanding with Multimodal Large Language Models [59.65010502000344]
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions.<n>Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better.<n>We introduce EventGPT, the first MLLM for event stream understanding.
arXiv Detail & Related papers (2024-12-01T14:38:40Z)
Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z)
Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM [47.786978666537436]
We propose a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution in movie videos. In the local stage, we introduce an interaction-aware prefix that guides the model to focus on the relevant multimodal information within a single clip. In the global stage, we strengthen the connections between associated events using an inferential knowledge graph.
arXiv Detail & Related papers (2024-09-14T08:30:59Z)
Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE) This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z)
Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams [19.957857885844838]
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. We propose an attentionaware model named Event Voxel Set Transformer (EVSTr) for efficient representation learning on event streams. Experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
arXiv Detail & Related papers (2023-03-07T12:48:02Z)
CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models. We take advantage of text information extraction technologies to obtain event structural knowledge. Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z)
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video. The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass. The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.