EventFlash: Towards Efficient MLLMs for Event-Based Vision
- URL: http://arxiv.org/abs/2602.03230v1
- Date: Tue, 03 Feb 2026 08:06:45 GMT
- Title: EventFlash: Towards Efficient MLLMs for Event-Based Vision
- Authors: Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Wen Jiang, Ming Li, Xiangyang Ji,
- Abstract summary: Event-based multimodal language models (LMLMs) enable robust perception in high-speed and low-light scenarios.<n>We build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets.<n>We present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens.<n>We believe EventFlash serves as an efficient foundation model for event-based vision.
- Score: 55.65520031675231
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4\times$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.
Related papers
- Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking [51.31378940976401]
Existing RGB-Event tracking approaches fail to fully exploit the unique advantages of event cameras.<n>We propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality.<n>Experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method.
arXiv Detail & Related papers (2026-01-03T01:10:17Z) - EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models [56.16721798968254]
We propose an event-guided, training-free framework for efficient understanding, named EventSTU.<n>In the temporal domain, we design a coarse-to-fine sampling algorithm that the change-triggered property of event cameras to eliminate redundant large frames.<n>In the spatial domain, we achieves an adaptive token pruning algorithm that leverages the saliency of events as a zero-cost prior to guide spatial reduction.
arXiv Detail & Related papers (2025-11-24T09:30:02Z) - Learning Efficient Meshflow and Optical Flow from Event Cameras [89.06553762828521]
Event-based meshflow estimation is a novel task that involves predicting a spatially smooth sparse motion field from event cameras.<n>We propose Efficient Event-based MeshFlow network, a lightweight model featuring a specially crafted encoder-decoder architecture.<n>We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (30x faster) of our EEMFlow model compared to the recent state-of-the-art flow method.
arXiv Detail & Related papers (2025-10-05T09:30:59Z) - Event-based Facial Keypoint Alignment via Cross-Modal Fusion Attention and Self-Supervised Multi-Event Representation Learning [16.170645576584487]
Event cameras offer unique advantages for facial keypoint alignment under challenging conditions.<n>We propose a novel framework based on cross-modal fusion attention (CMFA) and self-supervised multi-event representation learning (SSMER) for event-based facial keypoint alignment.
arXiv Detail & Related papers (2025-09-29T16:00:50Z) - LET-US: Long Event-Text Understanding of Scenes [23.376693904132786]
Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution.<n>We introduce LET-US, a framework for long event-stream--text comprehension.<n>We use an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details.
arXiv Detail & Related papers (2025-08-10T16:02:41Z) - EventVL: Understand Event Streams via Multimodal Large Language Model [29.23525787969373]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding.<n> Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset.<n>To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z) - EventGPT: Event Stream Understanding with Multimodal Large Language Models [59.65010502000344]
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions.<n>Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better.<n>We introduce EventGPT, the first MLLM for event stream understanding.
arXiv Detail & Related papers (2024-12-01T14:38:40Z) - Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba [11.400397931501338]
Event cameras draw inspiration from biological systems, boasting low latency and high dynamic range while consuming minimal power.<n>Most current approach to processing Event Cloud often involves converting it into frame-based representations.<n>We propose EventMamba, an efficient and effective framework based on Point Cloud representation.
arXiv Detail & Related papers (2024-05-09T21:47:46Z) - Fast Window-Based Event Denoising with Spatiotemporal Correlation
Enhancement [85.66867277156089]
We propose window-based event denoising, which simultaneously deals with a stack of events.
In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise.
Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks.
arXiv Detail & Related papers (2024-02-14T15:56:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.