Action-Agnostic Point-Level Supervision for Temporal Action Detection
- URL: http://arxiv.org/abs/2412.21205v1
- Date: Mon, 30 Dec 2024 18:59:55 GMT
- Title: Action-Agnostic Point-Level Supervision for Temporal Action Detection
- Authors: Shuhei M. Yoshida, Takashi Shibata, Makoto Terao, Takayuki Okatani, Masashi Sugiyama,
- Abstract summary: We propose action-agnostic point-level supervision for temporal action detection with a lightly annotated dataset.
In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories.
Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention.
- Score: 55.86569092972912
- License:
- Abstract: We propose action-agnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.
Related papers
- What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - Proposal-based Temporal Action Localization with Point-level Supervision [29.98225940694062]
Point-level supervised temporal action localization (PTAL) aims at recognizing and localizing actions in untrimmed videos.
We propose a novel method that localizes actions by generating and evaluating action proposals of flexible duration.
Experiments show that our proposed method achieves competitive or superior performance to the state-of-the-art methods.
arXiv Detail & Related papers (2023-10-09T08:27:05Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Active Learning with Effective Scoring Functions for Semi-Supervised
Temporal Action Localization [15.031156121516211]
This paper focuses on a rarely investigated yet practical task named semi-supervised TAL.
We propose an effective active learning method, named AL-STAL.
Experiment results show that AL-STAL outperforms the existing competitors and achieves satisfying performance compared with fully-supervised learning.
arXiv Detail & Related papers (2022-08-31T13:39:38Z) - Active Pointly-Supervised Instance Segmentation [106.38955769817747]
We present an economic active learning setting, named active pointly-supervised instance segmentation (APIS)
APIS starts with box-level annotations and iteratively samples a point within the box and asks if it falls on the object.
The model developed with these strategies yields consistent performance gain on the challenging MS-COCO dataset.
arXiv Detail & Related papers (2022-07-23T11:25:24Z) - End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data.
Video action detection requires both, action class prediction as well as a-temporal consistency.
We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z) - Self-supervised Pretraining with Classification Labels for Temporal
Activity Detection [54.366236719520565]
Temporal Activity Detection aims to predict activity classes per frame.
Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited.
This work proposes a novel self-supervised pretraining method for detection leveraging classification labels.
arXiv Detail & Related papers (2021-11-26T18:59:28Z) - Discovering Multi-Label Actor-Action Association in a Weakly Supervised
Setting [22.86745487695168]
We propose a baseline based on multi-instance and multi-label learning.
We propose a novel approach that uses sets of actions as representation instead of modeling individual action classes.
We evaluate the proposed approach on the challenging dataset where the proposed approach outperforms the MIML baseline and is competitive to fully supervised approaches.
arXiv Detail & Related papers (2021-01-21T11:59:47Z) - Temporal Action Detection with Multi-level Supervision [116.55596693897388]
We introduce the Semi-supervised Action Detection (SSAD) task with a mixture of labeled and unlabeled data.
We analyze different types of errors in the proposed SSAD baselines which are directly adapted from the semi-supervised classification task.
We incorporate weakly-labeled data into SSAD and propose Omni-supervised Action Detection (OSAD) with three levels of supervision.
arXiv Detail & Related papers (2020-11-24T04:45:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.