Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
- URL: http://arxiv.org/abs/2508.04546v1
- Date: Wed, 06 Aug 2025 15:33:49 GMT
- Title: Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
- Authors: Minghang Zheng, Yuxin Peng, Benyuan Sun, Yi Yang, Yang Liu,
- Abstract summary: We tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream.<n>Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames.<n>We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations.
- Score: 49.51013055630857
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.
Related papers
- VidEvent: A Large Dataset for Understanding Dynamic Evolution of Events in Videos [6.442765801124304]
We propose the task of video event understanding that extracts event scripts and makes predictions with these scripts from videos.<n>To support this task, we introduce VidEvent, a large-scale dataset containing over 23,000 well-labeled events.<n>The dataset was created through a meticulous annotation process, ensuring high-quality and reliable event data.
arXiv Detail & Related papers (2025-06-03T05:12:48Z) - Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark [36.9654606035663]
We introduce a novel hierarchical knowledge distillation strategy to guide the learning of the student Transformer network.<n>We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy.<n>We propose EventVOT, the first large-scale high-resolution event-based tracking dataset.
arXiv Detail & Related papers (2025-02-08T13:59:52Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Exploring Event-based Human Pose Estimation with 3D Event Representations [26.34100847541989]
We introduce two 3D event representations: the Rasterized Event Point Cloud (Ras EPC) and the Decoupled Event Voxel (DEV)
The Ras EPC aggregates events within concise temporal slices at identical positions, preserving their 3D attributes along with statistical information, thereby significantly reducing memory and computational demands.
Our methods are tested on the DHP19 public dataset, MMHPSD dataset, and our EV-3DPW dataset, with further qualitative validation via a derived driving scene dataset EV-JAAD and an outdoor collection vehicle.
arXiv Detail & Related papers (2023-11-08T10:45:09Z) - Exploring the Limits of Historical Information for Temporal Knowledge
Graph Extrapolation [59.417443739208146]
We propose a new event forecasting model based on a novel training framework of historical contrastive learning.
CENET learns both the historical and non-historical dependency to distinguish the most potential entities.
We evaluate our proposed model on five benchmark graphs.
arXiv Detail & Related papers (2023-08-29T03:26:38Z) - Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z) - Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z) - A Graph Enhanced BERT Model for Event Prediction [35.02248467245135]
We consider automatically building of event graph using a BERT model.
We incorporate an additional structured variable into BERT to learn to predict the event connections in the training process.
Results on two event prediction tasks: script event prediction and story ending prediction, show that our approach can outperform state-of-the-art baseline methods.
arXiv Detail & Related papers (2022-05-22T13:37:38Z) - Meta-Reinforcement Learning via Buffering Graph Signatures for Live
Video Streaming Events [4.332367445046418]
We present a meta-learning model to adapt the predictions of the network's capacity between viewers who participate in a live video streaming event.
We evaluate the proposed model on the link weight prediction task on three real-world of live video streaming events.
arXiv Detail & Related papers (2021-10-03T14:03:22Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.