Related papers: E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

URL: http://arxiv.org/abs/2602.05215v1
Date: Thu, 05 Feb 2026 02:16:00 GMT
Title: E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching
Authors: Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan, Alex C. Kot, Shijian Lu,
Abstract summary: Temporal Video Grounding aims to precisely localize time segments corresponding to query events.<n>E.M.Ground is a novel Vid-LLM for TVG that focuses on holistic and coherent event perception.<n>E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
Score: 87.38371267983263
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special <event> token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.

Related papers

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models [56.16721798968254]
We propose an event-guided, training-free framework for efficient understanding, named EventSTU.<n>In the temporal domain, we design a coarse-to-fine sampling algorithm that the change-triggered property of event cameras to eliminate redundant large frames.<n>In the spatial domain, we achieves an adaptive token pruning algorithm that leverages the saliency of events as a zero-cost prior to guide spatial reduction.
arXiv Detail & Related papers (2025-11-24T09:30:02Z)
Dense Video Understanding with Gated Residual Tokenization [49.17263029080152]
High temporal resolution is essential for capturing fine-grained details in video understanding.<n>Current benchmarks rely mostly on low-frame-rate sampling.<n>Dense Video Understanding (DVU) enables high-FPS video comprehension by reducing both tokenization time and token overhead.
arXiv Detail & Related papers (2025-09-17T17:34:40Z)
DATE: Dynamic Absolute Time Enhancement for Long Video Understanding [8.720269393713451]
Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
arXiv Detail & Related papers (2025-09-11T08:49:22Z)
Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z)
LET-US: Long Event-Text Understanding of Scenes [23.376693904132786]
Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution.<n>We introduce LET-US, a framework for long event-stream--text comprehension.<n>We use an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details.
arXiv Detail & Related papers (2025-08-10T16:02:41Z)
PASS: Path-selective State Space Model for Event-based Recognition [12.651829415097758]
Event cameras are bio-inspired sensors with advantages, such as high temporal resolution.<n>We present our PASS framework, exhibiting superior capacity for high event modeling.<n>Our key insight is to learn adaptively encoded event features via the state space models.
arXiv Detail & Related papers (2024-09-25T14:08:37Z)
GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval [60.70901959953688]
We present GMMFormer v2, an uncertainty-aware framework for PRVR. For clip modeling, we improve a strong baseline GMMFormer with a novel temporal consolidation module. We propose a novel optimal matching loss for fine-grained text-clip alignment.
arXiv Detail & Related papers (2024-05-22T16:55:31Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. Key to our approach is the use of both global and local temporal constraints. Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.