Weakly-supervised Temporal Action Localization by Uncertainty Modeling
- URL: http://arxiv.org/abs/2006.07006v3
- Date: Thu, 17 Dec 2020 07:12:38 GMT
- Title: Weakly-supervised Temporal Action Localization by Uncertainty Modeling
- Authors: Pilhyeon Lee, Jinglu Wang, Yan Lu, Hyeran Byun
- Abstract summary: Weakly-supervised temporal action localization aims to learn detecting temporal intervals of action classes with only video-level labels.
We present a new perspective on background frames where they are modeled as out-of-distribution samples regarding their inconsistency.
- Score: 34.27514534497615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised temporal action localization aims to learn detecting
temporal intervals of action classes with only video-level labels. To this end,
it is crucial to separate frames of action classes from the background frames
(i.e., frames not belonging to any action classes). In this paper, we present a
new perspective on background frames where they are modeled as
out-of-distribution samples regarding their inconsistency. Then, background
frames can be detected by estimating the probability of each frame being
out-of-distribution, known as uncertainty, but it is infeasible to directly
learn uncertainty without frame-level labels. To realize the uncertainty
learning in the weakly-supervised setting, we leverage the multiple instance
learning formulation. Moreover, we further introduce a background entropy loss
to better discriminate background frames by encouraging their in-distribution
(action) probabilities to be uniformly distributed over all action classes.
Experimental results show that our uncertainty modeling is effective at
alleviating the interference of background frames and brings a large
performance gain without bells and whistles. We demonstrate that our model
significantly outperforms state-of-the-art methods on the benchmarks, THUMOS'14
and ActivityNet (1.2 & 1.3). Our code is available at
https://github.com/Pilhyeon/WTAL-Uncertainty-Modeling.
Related papers
- ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition [52.537021302246664]
Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance)
We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes.
We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51.
arXiv Detail & Related papers (2025-01-31T20:47:06Z) - Improving Training and Inference of Face Recognition Models via Random
Temperature Scaling [45.33976405587231]
Random Temperature Scaling (RTS) is proposed to learn a reliable face recognition algorithm.
RTS can achieve top performance on both the face recognition and out-of-distribution detection tasks.
The proposed module is light-weight and only adds negligible cost to the model.
arXiv Detail & Related papers (2022-12-02T08:00:03Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Tracking the risk of a deployed model and detecting harmful distribution
shifts [105.27463615756733]
In practice, it may make sense to ignore benign shifts, under which the performance of a deployed model does not degrade substantially.
We argue that a sensible method for firing off a warning has to both (a) detect harmful shifts while ignoring benign ones, and (b) allow continuous monitoring of model performance without increasing the false alarm rate.
arXiv Detail & Related papers (2021-10-12T17:21:41Z) - A Low Rank Promoting Prior for Unsupervised Contrastive Learning [108.91406719395417]
We construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning.
Our hypothesis explicitly requires that all the samples belonging to the same instance class lie on the same subspace with small dimension.
Empirical evidences show that the proposed algorithm clearly surpasses the state-of-the-art approaches on multiple benchmarks.
arXiv Detail & Related papers (2021-08-05T15:58:25Z) - Semi-supervised Facial Action Unit Intensity Estimation with Contrastive
Learning [54.90704746573636]
Our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2%$ of annotated frames.
We experimentally validate that our method outperforms existing methods when working with as little as $2%$ of randomly chosen data.
arXiv Detail & Related papers (2020-11-03T17:35:57Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.