Weakly Supervised Action Selection Learning in Video
- URL: http://arxiv.org/abs/2105.02439v1
- Date: Thu, 6 May 2021 04:39:29 GMT
- Title: Weakly Supervised Action Selection Learning in Video
- Authors: Junwei Ma, Satya Krishna Gorti, Maksims Volkovs, Guangwei Yu
- Abstract summary: Action Selection Learning is proposed to capture the general concept of action, a property we refer to as "actionness"
We show that ASL outperforms leading baselines on two popular benchmarks THUMOS-14 and ActivityNet-1.2, with 10.3% and 5.7% relative improvement respectively.
- Score: 8.337649176647645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing actions in video is a core task in computer vision. The weakly
supervised temporal localization problem investigates whether this task can be
adequately solved with only video-level labels, significantly reducing the
amount of expensive and error-prone annotation that is required. A common
approach is to train a frame-level classifier where frames with the highest
class probability are selected to make a video-level prediction. Frame level
activations are then used for localization. However, the absence of frame-level
annotations cause the classifier to impart class bias on every frame. To
address this, we propose the Action Selection Learning (ASL) approach to
capture the general concept of action, a property we refer to as "actionness".
Under ASL, the model is trained with a novel class-agnostic task to predict
which frames will be selected by the classifier. Empirically, we show that ASL
outperforms leading baselines on two popular benchmarks THUMOS-14 and
ActivityNet-1.2, with 10.3% and 5.7% relative improvement respectively. We
further analyze the properties of ASL and demonstrate the importance of
actionness. Full code for this work is available here:
https://github.com/layer6ai-labs/ASL.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Zero-Shot Temporal Action Detection via Vision-Language Prompting [134.26292288193298]
We propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE)
Our model significantly outperforms state-of-the-art alternatives.
Our model also yields superior results on supervised TAD over recent strong competitors.
arXiv Detail & Related papers (2022-07-17T13:59:46Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Few-Shot Temporal Action Localization with Query Adaptive Transformer [105.84328176530303]
TAL works rely on a large number of training videos with exhaustive segment-level annotation.
Few-shot TAL aims to adapt a model to a new class represented by as few as a single video.
arXiv Detail & Related papers (2021-10-20T13:18:01Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.