Generic Event Boundary Detection in Video with Pyramid Features
- URL: http://arxiv.org/abs/2301.04288v1
- Date: Wed, 11 Jan 2023 03:29:27 GMT
- Title: Generic Event Boundary Detection in Video with Pyramid Features
- Authors: Van Thong Huynh, Hyung-Jeong Yang, Guee-Sang Lee, Soo-Hyung Kim
- Abstract summary: Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries.
We present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions.
- Score: 12.896848011230523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generic event boundary detection (GEBD) aims to split video into chunks at a
broad and diverse set of actions as humans naturally perceive event boundaries.
In this study, we present an approach that considers the correlation between
neighbor frames with pyramid feature maps in both spatial and temporal
dimensions to construct a framework for localizing generic events in video. The
features at multiple spatial dimensions of a pre-trained ResNet-50 are
exploited with different views in the temporal dimension to form a temporal
pyramid feature map. Based on that, the similarity between neighbor frames is
calculated and projected to build a temporal pyramid similarity feature vector.
A decoder with 1D convolution operations is used to decode these similarities
to a new representation that incorporates their temporal relationship for later
boundary score estimation. Extensive experiments conducted on the GEBD
benchmark dataset show the effectiveness of our system and its variations, in
which we outperformed the state-of-the-art approaches. Additional experiments
on TAPOS dataset, which contains long-form videos with Olympic sport actions,
demonstrated the effectiveness of our study compared to others.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Mumpy: Multilateral Temporal-view Pyramid Transformer for Video Inpainting Detection [41.4800103693756]
We introduce a novel Multilateral Temporal-view Pyramid Transformer (em MumPy) that collaborates spatial-temporal clues flexibly.
Our method utilizes a newly designed multilateral temporal-view to extract various collaborations of spatial-temporal clues and introduces a deformable window-based temporal-view interaction module.
By adjusting the contribution strength of spatial and temporal clues, our method can effectively identify inpainted regions.
arXiv Detail & Related papers (2024-04-17T03:56:28Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Temporal Action Localization with Enhanced Instant Discriminability [66.76095239972094]
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video.
We propose a one-stage framework named TriDet to resolve imprecise predictions of action boundaries by existing methods.
Experimental results demonstrate the robustness of TriDet and its state-of-the-art performance on multiple TAD datasets.
arXiv Detail & Related papers (2023-09-11T16:17:50Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Structured Context Transformer for Generic Event Boundary Detection [32.09242716244653]
We present Structured Context Transformer (or SC-Transformer) to solve the Generic Event Boundary Detection task.
We use the backbone convolutional neural network (CNN) to extract the features of each video frame.
A lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps.
arXiv Detail & Related papers (2022-06-07T03:00:24Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.