Discovering Multi-Label Actor-Action Association in a Weakly Supervised
Setting
- URL: http://arxiv.org/abs/2101.08567v1
- Date: Thu, 21 Jan 2021 11:59:47 GMT
- Title: Discovering Multi-Label Actor-Action Association in a Weakly Supervised
Setting
- Authors: Sovan Biswas and Juergen Gall
- Abstract summary: We propose a baseline based on multi-instance and multi-label learning.
We propose a novel approach that uses sets of actions as representation instead of modeling individual action classes.
We evaluate the proposed approach on the challenging dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
- Score: 22.86745487695168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since collecting and annotating data for spatio-temporal action detection is
very expensive, there is a need to learn approaches with less supervision.
Weakly supervised approaches do not require any bounding box annotations and
can be trained only from labels that indicate whether an action occurs in a
video clip. Current approaches, however, cannot handle the case when there are
multiple persons in a video that perform multiple actions at the same time. In
this work, we address this very challenging task for the first time. We propose
a baseline based on multi-instance and multi-label learning. Furthermore, we
propose a novel approach that uses sets of actions as representation instead of
modeling individual action classes. Since computing, the probabilities for the
full power set becomes intractable as the number of action classes increases,
we assign an action set to each detected person under the constraint that the
assignment is consistent with the annotation of the video clip. We evaluate the
proposed approach on the challenging AVA dataset where the proposed approach
outperforms the MIML baseline and is competitive to fully supervised
approaches.
Related papers
- FMI-TAL: Few-shot Multiple Instances Temporal Action Localization by Probability Distribution Learning and Interval Cluster Refinement [2.261014973523156]
We propose a novel solution involving a spatial-channel relation transformer with probability learning and cluster refinement.
This method can accurately identify the start and end boundaries of actions in the query video.
Our model achieves competitive performance through meticulous experimentation utilizing the benchmark datasets ActivityNet1.3 and THUMOS14.
arXiv Detail & Related papers (2024-08-25T08:17:25Z) - Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting [87.11995635760108]
Key to action counting is accurately locating each video's repetitive actions.
We propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner.
arXiv Detail & Related papers (2024-06-13T05:15:52Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features [2.8266810371534152]
Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach.
The proposed method achieves superior results compared to the other methods in both Open-vocab and Closed-vocab settings.
arXiv Detail & Related papers (2024-04-30T13:14:28Z) - Weakly Supervised Video Individual CountingWeakly Supervised Video
Individual Counting [126.75545291243142]
Video Individual Counting aims to predict the number of unique individuals in a single video.
We introduce a weakly supervised VIC task, wherein trajectory labels are not provided.
In doing so, we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow, outflow, and the remaining.
arXiv Detail & Related papers (2023-12-10T16:12:13Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - PointTAD: Multi-Label Temporal Action Detection with Learnable Query
Points [28.607690605262878]
temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label.
In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video.
We extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD.
arXiv Detail & Related papers (2022-10-20T06:08:03Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.