Background-Click Supervision for Temporal Action Localization
- URL: http://arxiv.org/abs/2111.12449v1
- Date: Wed, 24 Nov 2021 12:02:52 GMT
- Title: Background-Click Supervision for Temporal Action Localization
- Authors: Le Yang, Junwei Han, Tao Zhao, Tianwei Lin, Dingwen Zhang, Jianxin
Chen
- Abstract summary: Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
- Score: 82.4203995101082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised temporal action localization aims at learning the
instance-level action pattern from the video-level labels, where a significant
challenge is action-context confusion. To overcome this challenge, one recent
work builds an action-click supervision framework. It requires similar
annotation costs but can steadily improve the localization performance when
compared to the conventional weakly supervised methods. In this paper, by
revealing that the performance bottleneck of the existing approaches mainly
comes from the background errors, we find that a stronger action localizer can
be trained with labels on the background video frames rather than those on the
action frames. To this end, we convert the action-click supervision to the
background-click supervision and develop a novel method, called BackTAL.
Specifically, BackTAL implements two-fold modeling on the background video
frames, i.e. the position modeling and the feature modeling. In position
modeling, we not only conduct supervised learning on the annotated video frames
but also design a score separation module to enlarge the score differences
between the potential action frames and backgrounds. In feature modeling, we
propose an affinity module to measure frame-specific similarities among
neighboring frames and dynamically attend to informative neighbors when
calculating temporal convolution. Extensive experiments on three benchmarks are
conducted, which demonstrate the high performance of the established BackTAL
and the rationality of the proposed background-click supervision. Code is
available at https://github.com/VividLe/BackTAL.
Related papers
- TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Structured Attention Composition for Temporal Action Localization [99.66510088698051]
We study temporal action localization from the perspective of multi-modality feature learning.
Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently.
The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks.
arXiv Detail & Related papers (2022-05-20T04:32:09Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Retrieving and Highlighting Action with Spatiotemporal Reference [15.283548146322971]
We present a framework that jointly retrieves andtemporally highlights actions in videos.
Our work takes on the novel task of highlighting action highlighting, which visualizes where and when actions occur in an un video setting.
arXiv Detail & Related papers (2020-05-19T03:12:31Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z) - Action Localization through Continual Predictive Learning [14.582013761620738]
We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
arXiv Detail & Related papers (2020-03-26T23:32:43Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.