Self-supervised Pretraining with Classification Labels for Temporal
Activity Detection
- URL: http://arxiv.org/abs/2111.13675v1
- Date: Fri, 26 Nov 2021 18:59:28 GMT
- Title: Self-supervised Pretraining with Classification Labels for Temporal
Activity Detection
- Authors: Kumara Kahatapitiya, Zhou Ren, Haoxiang Li, Zhenyu Wu and Michael S.
Ryoo
- Abstract summary: Temporal Activity Detection aims to predict activity classes per frame.
Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited.
This work proposes a novel self-supervised pretraining method for detection leveraging classification labels.
- Score: 54.366236719520565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal Activity Detection aims to predict activity classes per frame, in
contrast to video-level predictions as done in Activity Classification (i.e.,
Activity Recognition). Due to the expensive frame-level annotations required
for detection, the scale of detection datasets is limited. Thus, commonly,
previous work on temporal activity detection resorts to fine-tuning a
classification model pretrained on large-scale classification datasets (e.g.,
Kinetics-400). However, such pretrained models are not ideal for downstream
detection performance due to the disparity between the pretraining and the
downstream fine-tuning tasks. This work proposes a novel self-supervised
pretraining method for detection leveraging classification labels to mitigate
such disparity by introducing frame-level pseudo labels, multi-action frames,
and action segments. We show that the models pretrained with the proposed
self-supervised detection task outperform prior work on multiple challenging
activity detection benchmarks, including Charades and MultiTHUMOS. Our
extensive ablations further provide insights on when and how to use the
proposed models for activity detection. Code and models will be released
online.
Related papers
- Investigating Self-Supervised Methods for Label-Efficient Learning [27.029542823306866]
We study different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities.
We introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks.
When testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
arXiv Detail & Related papers (2024-06-25T10:56:03Z) - Aligned Unsupervised Pretraining of Object Detectors with Self-training [41.03780087924593]
Unsupervised pretraining of object detectors has recently become a key component of object detector training.
We propose a framework that mitigates this issue and consists of three simple yet key ingredients.
We show that our strategy is also capable of pretraining from scratch (including the backbone) and works on complex images like COCO.
arXiv Detail & Related papers (2023-07-28T17:46:00Z) - Label-Efficient Object Detection via Region Proposal Network
Pre-Training [58.50615557874024]
We propose a simple pretext task that provides an effective pre-training for the region proposal network (RPN)
In comparison with multi-stage detectors without RPN pre-training, our approach is able to consistently improve downstream task performance.
arXiv Detail & Related papers (2022-11-16T16:28:18Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Cluster & Tune: Boost Cold Start Performance in Text Classification [21.957605438780224]
In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce.
We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task.
arXiv Detail & Related papers (2022-03-20T15:29:34Z) - DAP: Detection-Aware Pre-training with Weak Supervision [37.336674323981285]
This paper presents a detection-aware pre-training (DAP) approach for object detection tasks.
We transform a classification dataset into a detection dataset through a weakly supervised object localization method based on Class Activation Maps.
We show that DAP can outperform the traditional classification pre-training in terms of both sample efficiency and convergence speed in downstream detection tasks including VOC and COCO.
arXiv Detail & Related papers (2021-03-30T19:48:30Z) - Overcoming Classifier Imbalance for Long-tail Object Detection with
Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution.
We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training.
Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.