Related papers: Structured Context Transformer for Generic Event Boundary Detection

Structured Context Transformer for Generic Event Boundary Detection

URL: http://arxiv.org/abs/2206.02985v1
Date: Tue, 7 Jun 2022 03:00:24 GMT
Title: Structured Context Transformer for Generic Event Boundary Detection
Authors: Congcong Li, Xinyao Wang, Dexiang Hong, Yufei Wang, Libo Zhang, Tiejian Luo, Longyin Wen
Abstract summary: We present Structured Context Transformer (or SC-Transformer) to solve the Generic Event Boundary Detection task. We use the backbone convolutional neural network (CNN) to extract the features of each video frame. A lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps.
Score: 32.09242716244653
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to-end fashion. Specifically, we use the backbone convolutional neural network (CNN) to extract the features of each video frame. To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence. Note that, the overall computation complexity of SC-Transformer is linear to the video length. After that, the group similarities are computed to capture the differences between frames. Then, a lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, the Gaussian kernel is adopted to preprocess the ground-truth event boundaries to further boost the accuracy. Extensive experiments conducted on the challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.

Related papers

CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks [10.622511683372815]
We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch. Then, it uses a feed-forward transformer architecture to globally refine the boundary structures of each patch to generate an edge map and a smoothed color map simultaneously.
arXiv Detail & Related papers (2024-03-25T07:22:22Z)
Local Compressed Video Stream Learning for Generic Event Boundary Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network. We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z)
A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc. Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z)
Generic Event Boundary Detection in Video with Pyramid Features [12.896848011230523]
Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. We present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions.
arXiv Detail & Related papers (2023-01-11T03:29:27Z)
End-to-End Compressed Video Representation Learning for Generic Event Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection. We first use the ConvNets to extract features of the I-frames in the GOPs. After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames. A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z)
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection [48.33132632418303]
Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units. Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM. We present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries.
arXiv Detail & Related papers (2022-03-01T09:31:30Z)
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field. We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network. An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation. Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z)
Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video. In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.