Learning to Localize Actions from Moments
- URL: http://arxiv.org/abs/2008.13705v1
- Date: Mon, 31 Aug 2020 16:03:47 GMT
- Title: Learning to Localize Actions from Moments
- Authors: Fuchen Long and Ting Yao and Zhaofan Qiu and Xinmei Tian and Jiebo Luo
and Tao Mei
- Abstract summary: We introduce a new design of transfer learning type to learn action localization for a large set of action categories.
We present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework.
- Score: 153.54638582696128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the knowledge of action moments (i.e., trimmed video clips that each
contains an action instance), humans could routinely localize an action
temporally in an untrimmed video. Nevertheless, most practical methods still
require all training videos to be labeled with temporal annotations (action
category and temporal boundary) and develop the models in a fully-supervised
manner, despite expensive labeling efforts and inapplicable to new categories.
In this paper, we introduce a new design of transfer learning type to learn
action localization for a large set of action categories, but only on action
moments from the categories of interest and temporal annotations of untrimmed
videos from a small set of action classes. Specifically, we present Action
Herald Networks (AherNet) that integrate such design into an one-stage action
localization framework. Technically, a weight transfer function is uniquely
devised to build the transformation between classification of action moments or
foreground video segments and action localization in synthetic contextual
moments or untrimmed videos. The context of each moment is learnt through the
adversarial mechanism to differentiate the generated features from those of
background in untrimmed videos. Extensive experiments are conducted on the
learning both across the splits of ActivityNet v1.3 and from THUMOS14 to
ActivityNet v1.3. Our AherNet demonstrates the superiority even comparing to
most fully-supervised action localization methods. More remarkably, we train
AherNet to localize actions from 600 categories on the leverage of action
moments in Kinetics-600 and temporal annotations from 200 classes in
ActivityNet v1.3. Source code and data are available at
\url{https://github.com/FuchenUSTC/AherNet}.
Related papers
- Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - Few-Shot Temporal Action Localization with Query Adaptive Transformer [105.84328176530303]
TAL works rely on a large number of training videos with exhaustive segment-level annotation.
Few-shot TAL aims to adapt a model to a new class represented by as few as a single video.
arXiv Detail & Related papers (2021-10-20T13:18:01Z) - Few-Shot Action Localization without Knowing Boundaries [9.959844922120523]
We show that it is possible to learn to localize actions in untrimmed videos when only one/few trimmed examples of the target action are available at test time.
We propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos.
Our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
arXiv Detail & Related papers (2021-06-08T07:32:43Z) - FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos.
This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories.
Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.