Local Compressed Video Stream Learning for Generic Event Boundary
Detection
- URL: http://arxiv.org/abs/2309.15431v1
- Date: Wed, 27 Sep 2023 06:49:40 GMT
- Title: Local Compressed Video Stream Learning for Generic Event Boundary
Detection
- Authors: Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan
- Abstract summary: Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
Existing methods typically require video frames to be decoded before feeding into the network.
We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
- Score: 25.37983456118522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic event boundary detection aims to localize the generic, taxonomy-free
event boundaries that segment videos into chunks. Existing methods typically
require video frames to be decoded before feeding into the network, which
contains significant spatio-temporal redundancy and demands considerable
computational power and storage space. To remedy these issues, we propose a
novel compressed video representation learning method for event boundary
detection that is fully end-to-end leveraging rich information in the
compressed domain, i.e., RGB, motion vectors, residuals, and the internal group
of pictures (GOP) structure, without fully decoding the video. Specifically, we
use lightweight ConvNets to extract features of the P-frames in the GOPs and
spatial-channel attention module (SCAM) is designed to refine the feature
representations of the P-frames based on the compressed information with
bidirectional information flow. To learn a suitable representation for boundary
detection, we construct the local frames bag for each candidate frame and use
the long short-term memory (LSTM) module to capture temporal relationships. We
then compute frame differences with group similarities in the temporal domain.
This module is only applied within a local window, which is critical for event
boundary detection. Finally a simple classifier is used to determine the event
boundaries of video sequences based on the learned feature representation. To
remedy the ambiguities of annotations and speed up the training process, we use
the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive
experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that
the proposed method achieves considerable improvements compared to previous
end-to-end approach while running at the same speed. The code is available at
https://github.com/GX77/LCVSL.
Related papers
- ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization [62.751303924391564]
How to effectively explore spatial-temporal features is important for video colorization.
We develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames.
We develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood.
arXiv Detail & Related papers (2024-04-09T12:23:30Z) - PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards
Video Object Detection [28.879484515844375]
We introduce a progressive way to introduce both temporal information and spatial information for an integrated enhancement.
PTSEFormer follows an end-to-end fashion to avoid heavy post-processing procedures while achieving 88.1% mAP on the ImageNet VID dataset.
arXiv Detail & Related papers (2022-09-06T06:32:57Z) - Structured Context Transformer for Generic Event Boundary Detection [32.09242716244653]
We present Structured Context Transformer (or SC-Transformer) to solve the Generic Event Boundary Detection task.
We use the backbone convolutional neural network (CNN) to extract the features of each video frame.
A lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps.
arXiv Detail & Related papers (2022-06-07T03:00:24Z) - End-to-End Compressed Video Representation Learning for Generic Event
Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection.
We first use the ConvNets to extract features of the I-frames in the GOPs.
After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames.
A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Temporal Perceiver: A General Architecture for Arbitrary Boundary
Detection [48.33132632418303]
Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units.
Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM.
We present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries.
arXiv Detail & Related papers (2022-03-01T09:31:30Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - BlockCopy: High-Resolution Video Processing with Block-Sparse Feature
Propagation and Online Policies [57.62315799929681]
BlockCopy is a scheme that accelerates pretrained frame-based CNNs to process video more efficiently.
A lightweight policy network determines important regions in an image, and operations are applied on selected regions only.
Features of non-selected regions are simply copied from the preceding frame, reducing the number of computations and latency.
arXiv Detail & Related papers (2021-08-20T21:16:01Z) - Temporal Modulation Network for Controllable Space-Time Video
Super-Resolution [66.06549492893947]
Space-time video super-resolution aims to increase the spatial and temporal resolutions of low-resolution and low-frame-rate videos.
Deformable convolution based methods have achieved promising STVSR performance, but they could only infer the intermediate frame pre-defined in the training stage.
We propose a Temporal Modulation Network (TMNet) to interpolate arbitrary intermediate frame(s) with accurate high-resolution reconstruction.
arXiv Detail & Related papers (2021-04-21T17:10:53Z) - ACDnet: An action detection network for real-time edge computing based
on flow-guided feature approximation and memory aggregation [8.013823319651395]
ACDnet is a compact action detection network targeting real-time edge computing.
It exploits the temporal coherence between successive video frames to approximate CNN features rather than naively extracting them.
It can robustly achieve detection well above real-time (75 FPS)
arXiv Detail & Related papers (2021-02-26T14:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.