Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization
- URL: http://arxiv.org/abs/2305.17861v1
- Date: Mon, 29 May 2023 02:48:04 GMT
- Title: Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization
- Authors: Huan Ren, Wenfei Yang, Tianzhu Zhang, Yongdong Zhang
- Abstract summary: Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
- Score: 98.66318678030491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised temporal action localization aims to localize and recognize
actions in untrimmed videos with only video-level category labels during
training. Without instance-level annotations, most existing methods follow the
Segment-based Multiple Instance Learning (S-MIL) framework, where the
predictions of segments are supervised by the labels of videos. However, the
objective for acquiring segment-level scores during training is not consistent
with the target for acquiring proposal-level scores during testing, leading to
suboptimal results. To deal with this problem, we propose a novel
Proposal-based Multiple Instance Learning (P-MIL) framework that directly
classifies the candidate proposals in both the training and testing stages,
which includes three key designs: 1) a surrounding contrastive feature
extraction module to suppress the discriminative short proposals by considering
the surrounding contrastive information, 2) a proposal completeness evaluation
module to inhibit the low-quality proposals with the guidance of the
completeness pseudo labels, and 3) an instance-level rank consistency loss to
achieve robust detection by leveraging the complementarity of RGB and FLOW
modalities. Extensive experimental results on two challenging benchmarks
including THUMOS14 and ActivityNet demonstrate the superior performance of our
method.
Related papers
- What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Leveraging triplet loss for unsupervised action segmentation [0.0]
We propose a fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself.
Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions.
Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches.
arXiv Detail & Related papers (2023-04-13T11:10:16Z) - Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario.
We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z) - Active Learning with Effective Scoring Functions for Semi-Supervised
Temporal Action Localization [15.031156121516211]
This paper focuses on a rarely investigated yet practical task named semi-supervised TAL.
We propose an effective active learning method, named AL-STAL.
Experiment results show that AL-STAL outperforms the existing competitors and achieves satisfying performance compared with fully-supervised learning.
arXiv Detail & Related papers (2022-08-31T13:39:38Z) - Few-shot Action Recognition with Prototype-centered Attentive Learning [88.10852114988829]
Prototype-centered Attentive Learning (PAL) model composed of two novel components.
First, a prototype-centered contrastive learning loss is introduced to complement the conventional query-centered learning objective.
Second, PAL integrates a attentive hybrid learning mechanism that can minimize the negative impacts of outliers.
arXiv Detail & Related papers (2021-01-20T11:48:12Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z) - Proposal Learning for Semi-Supervised Object Detection [76.83284279733722]
It is non-trivial to train object detectors on unlabeled data due to the unavailability of ground truth labels.
We present a proposal learning approach to learn proposal features and predictions from both labeled and unlabeled data.
arXiv Detail & Related papers (2020-01-15T00:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.