Related papers: Structured Context Learning for Generic Event Boundary Detection

Structured Context Learning for Generic Event Boundary Detection

URL: http://arxiv.org/abs/2512.00475v1
Date: Sat, 29 Nov 2025 13:06:52 GMT
Title: Structured Context Learning for Generic Event Boundary Detection
Authors: Xin Gu, Congcong Li, Xinyao Wang, Dexiang Hong, Libo Zhang, Tiejian Luo, Longyin Wen, Heng Fan,
Abstract summary: Generic Event Boundary Detection aims to identify moments in videos that humans perceive as event boundaries.<n>This paper proposes a novel method for addressing this task, called Structured Context Learning.<n>Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers.
Score: 34.30144454487081
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS's overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.

Related papers

Local Compressed Video Stream Learning for Generic Event Boundary Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network. We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z)
Learning Sequence Descriptor based on Spatio-Temporal Attention for Visual Place Recognition [16.380948630155476]
Visual Place Recognition (VPR) aims to retrieve frames from atagged database that are located at the same place as the query frame. To improve the robustness of VPR in geoly aliasing scenarios, sequence-based VPR methods are proposed. We use a sliding window to control the temporal range of attention and use relative positional encoding to construct sequential relationships between different features.
arXiv Detail & Related papers (2023-05-19T06:39:10Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Video Activity Localisation with Uncertainties in Temporal Boundary [74.7263952414899]
Methods for video activity localisation over time assume implicitly that activity temporal boundaries are determined and precise. In unscripted natural videos, different activities transit smoothly, so that it is intrinsically ambiguous to determine in labelling precisely when an activity starts and ends over time. We introduce Elastic Moment Bounding (EMB) to accommodate flexible and adaptive activity temporal boundaries.
arXiv Detail & Related papers (2022-06-26T16:45:56Z)
Structured Context Transformer for Generic Event Boundary Detection [32.09242716244653]
We present Structured Context Transformer (or SC-Transformer) to solve the Generic Event Boundary Detection task. We use the backbone convolutional neural network (CNN) to extract the features of each video frame. A lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps.
arXiv Detail & Related papers (2022-06-07T03:00:24Z)
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection. We first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames. A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z)
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection [48.33132632418303]
Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units. Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM. We present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries.
arXiv Detail & Related papers (2022-03-01T09:31:30Z)
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field. We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network. An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z)
Temporally-Consistent Surface Reconstruction using Metrically-Consistent Atlases [131.50372468579067]
We propose a method for unsupervised reconstruction of a temporally-consistent sequence of surfaces from a sequence of time-evolving point clouds. We represent the reconstructed surfaces as atlases computed by a neural network, which enables us to establish correspondences between frames. Our approach outperforms state-of-the-art ones on several challenging datasets.
arXiv Detail & Related papers (2021-11-12T17:48:25Z)
Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.