EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
- URL: http://arxiv.org/abs/2511.18920v1
- Date: Mon, 24 Nov 2025 09:30:02 GMT
- Title: EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models
- Authors: Wenhao Xu, Xin Dong, Yue Li, Haoyuan Shi, Zhiwei Xiong,
- Abstract summary: We propose an event-guided, training-free framework for efficient understanding, named EventSTU.<n>In the temporal domain, we design a coarse-to-fine sampling algorithm that the change-triggered property of event cameras to eliminate redundant large frames.<n>In the spatial domain, we achieves an adaptive token pruning algorithm that leverages the saliency of events as a zero-cost prior to guide spatial reduction.
- Score: 56.16721798968254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.
Related papers
- E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching [87.38371267983263]
Temporal Video Grounding aims to precisely localize time segments corresponding to query events.<n>E.M.Ground is a novel Vid-LLM for TVG that focuses on holistic and coherent event perception.<n>E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
arXiv Detail & Related papers (2026-02-05T02:16:00Z) - EventFlash: Towards Efficient MLLMs for Event-Based Vision [55.65520031675231]
Event-based multimodal language models (LMLMs) enable robust perception in high-speed and low-light scenarios.<n>We build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets.<n>We present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens.<n>We believe EventFlash serves as an efficient foundation model for event-based vision.
arXiv Detail & Related papers (2026-02-03T08:06:45Z) - LET-US: Long Event-Text Understanding of Scenes [23.376693904132786]
Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution.<n>We introduce LET-US, a framework for long event-stream--text comprehension.<n>We use an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details.
arXiv Detail & Related papers (2025-08-10T16:02:41Z) - TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action [28.930109403769166]
We propose TEMPURA, a two-stage training framework that enhances video temporal understanding.<n>TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations.<n>We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps.
arXiv Detail & Related papers (2025-05-02T21:00:17Z) - EventVL: Understand Event Streams via Multimodal Large Language Model [29.23525787969373]
We propose EventVL, the first generative event-based MLLM framework for explicit semantic understanding.<n> Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset.<n>To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events.
arXiv Detail & Related papers (2025-01-23T14:37:21Z) - EventGPT: Event Stream Understanding with Multimodal Large Language Models [59.65010502000344]
Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions.<n>Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better.<n>We introduce EventGPT, the first MLLM for event stream understanding.
arXiv Detail & Related papers (2024-12-01T14:38:40Z) - PASS: Path-selective State Space Model for Event-based Recognition [12.651829415097758]
Event cameras are bio-inspired sensors with advantages, such as high temporal resolution.<n>We present our PASS framework, exhibiting superior capacity for high event modeling.<n>Our key insight is to learn adaptively encoded event features via the state space models.
arXiv Detail & Related papers (2024-09-25T14:08:37Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Exploring Event-based Human Pose Estimation with 3D Event Representations [26.34100847541989]
We introduce two 3D event representations: the Rasterized Event Point Cloud (Ras EPC) and the Decoupled Event Voxel (DEV)
The Ras EPC aggregates events within concise temporal slices at identical positions, preserving their 3D attributes along with statistical information, thereby significantly reducing memory and computational demands.
Our methods are tested on the DHP19 public dataset, MMHPSD dataset, and our EV-3DPW dataset, with further qualitative validation via a derived driving scene dataset EV-JAAD and an outdoor collection vehicle.
arXiv Detail & Related papers (2023-11-08T10:45:09Z) - Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.