Boundary-Denoising for Video Activity Localization
- URL: http://arxiv.org/abs/2304.02934v1
- Date: Thu, 6 Apr 2023 08:48:01 GMT
- Title: Boundary-Denoising for Video Activity Localization
- Authors: Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel
P\'erez-R\'ua, Bernard Ghanem
- Abstract summary: We study the video activity localization problem from a denoising perspective.
Specifically, we propose an encoder-decoder model named DenoiseLoc.
Experiments show that DenoiseLoc advances %in several video activity understanding tasks.
- Score: 57.9973253014712
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video activity localization aims at understanding the semantic content in
long untrimmed videos and retrieving actions of interest. The retrieved action
with its start and end locations can be used for highlight generation, temporal
action detection, etc. Unfortunately, learning the exact boundary location of
activities is highly challenging because temporal activities are continuous in
time, and there are often no clear-cut transitions between actions. Moreover,
the definition of the start and end of events is subjective, which may confuse
the model. To alleviate the boundary ambiguity, we propose to study the video
activity localization problem from a denoising perspective. Specifically, we
propose an encoder-decoder model named DenoiseLoc. During training, a set of
action spans is randomly generated from the ground truth with a controlled
noise scale. Then we attempt to reverse this process by boundary denoising,
allowing the localizer to predict activities with precise boundaries and
resulting in faster convergence speed. Experiments show that DenoiseLoc
advances %in several video activity understanding tasks. For example, we
observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64%
mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves
state-of-the-art performance on TACoS and MAD datasets, but with much fewer
predictions compared to other current methods.
Related papers
- FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement [2.261014973523156]
We propose a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement.
This method can accurately identify the start and end boundaries of actions in the query video.
Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14.
arXiv Detail & Related papers (2024-08-25T08:17:25Z) - Disentangle and denoise: Tackling context misalignment for video moment retrieval [16.939535169282262]
Video Moment Retrieval aims to locate in-context video moments according to a natural language query.
This paper proposes a cross-modal Context Denoising Network (CDNet) for accurate moment retrieval.
arXiv Detail & Related papers (2024-08-14T15:00:27Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos.
This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories.
Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z) - Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z) - A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion
Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision.
ego-motion recorded sequences have become of important relevance.
The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.