End-to-End Compressed Video Representation Learning for Generic Event
Boundary Detection
- URL: http://arxiv.org/abs/2203.15336v1
- Date: Tue, 29 Mar 2022 08:27:48 GMT
- Title: End-to-End Compressed Video Representation Learning for Generic Event
Boundary Detection
- Authors: Congcong Li, Xinyao Wang, Longyin Wen, Dexiang Hong, Tiejian Luo, Libo
Zhang
- Abstract summary: We propose a new end-to-end compressed video representation learning for event boundary detection.
We first use the ConvNets to extract features of the I-frames in the GOPs.
After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames.
A temporal contrastive module is proposed to determine the event boundaries of video sequences.
- Score: 31.31508043234419
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generic event boundary detection aims to localize the generic, taxonomy-free
event boundaries that segment videos into chunks. Existing methods typically
require video frames to be decoded before feeding into the network, which
demands considerable computational power and storage space. To that end, we
propose a new end-to-end compressed video representation learning for event
boundary detection that leverages the rich information in the compressed
domain, i.e., RGB, motion vectors, residuals, and the internal group of
pictures (GOP) structure, without fully decoding the video. Specifically, we
first use the ConvNets to extract features of the I-frames in the GOPs. After
that, a light-weight spatial-channel compressed encoder is designed to compute
the feature representations of the P-frames based on the motion vectors,
residuals and representations of their dependent I-frames. A temporal
contrastive module is proposed to determine the event boundaries of video
sequences. To remedy the ambiguities of annotations and speed up the training
process, we use the Gaussian kernel to preprocess the ground-truth event
boundaries. Extensive experiments conducted on the Kinetics-GEBD dataset
demonstrate that the proposed method achieves comparable results to the
state-of-the-art methods with $4.5\times$ faster running speed.
Related papers
- Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - Local Compressed Video Stream Learning for Generic Event Boundary
Detection [25.37983456118522]
Event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
Existing methods typically require video frames to be decoded before feeding into the network.
We propose a novel event boundary detection method that is fully end-to-end leveraging rich information in the compressed domain.
arXiv Detail & Related papers (2023-09-27T06:49:40Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using
a New Frame Selection Policy and Gating Mechanism [8.395400675921515]
Gated-ViGAT is an efficient approach for video event recognition.
It uses bottom-up (object) information, a new frame sampling policy and a gating mechanism.
Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach.
arXiv Detail & Related papers (2023-01-18T14:36:22Z) - Structured Context Transformer for Generic Event Boundary Detection [32.09242716244653]
We present Structured Context Transformer (or SC-Transformer) to solve the Generic Event Boundary Detection task.
We use the backbone convolutional neural network (CNN) to extract the features of each video frame.
A lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps.
arXiv Detail & Related papers (2022-06-07T03:00:24Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Temporal Perceiver: A General Architecture for Arbitrary Boundary
Detection [48.33132632418303]
Generic Boundary Detection (GBD) aims at locating general boundaries that divide videos into semantically coherent and taxonomy-free units.
Previous research separately handle these different-level generic boundaries with specific designs of complicated deep networks from simple CNN to LSTM.
We present Temporal Perceiver, a general architecture with Transformers, offering a unified solution to the detection of arbitrary generic boundaries.
arXiv Detail & Related papers (2022-03-01T09:31:30Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial
Decoding [22.12530692711095]
Video compression reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP)
In this work, we introduce sampling the input for the network from partially decoded videos based on the GOP-level.
We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only.
arXiv Detail & Related papers (2021-10-17T12:56:03Z) - BlockCopy: High-Resolution Video Processing with Block-Sparse Feature
Propagation and Online Policies [57.62315799929681]
BlockCopy is a scheme that accelerates pretrained frame-based CNNs to process video more efficiently.
A lightweight policy network determines important regions in an image, and operations are applied on selected regions only.
Features of non-selected regions are simply copied from the preceding frame, reducing the number of computations and latency.
arXiv Detail & Related papers (2021-08-20T21:16:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.