COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
Spotting using Transformers
- URL: http://arxiv.org/abs/2309.01270v2
- Date: Thu, 26 Oct 2023 09:58:37 GMT
- Title: COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action
Spotting using Transformers
- Authors: Julien Denize, Mykola Liashuha, Jaonary Rabarisoa, Astrid Orcesi,
Romain H\'erault
- Abstract summary: We present COMEDIAN, a novel pipeline to initialize transformers for action spotting.
Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
- Score: 1.894259749028573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present COMEDIAN, a novel pipeline to initialize spatiotemporal
transformers for action spotting, which involves self-supervised learning and
knowledge distillation. Action spotting is a timestamp-level temporal action
detection task. Our pipeline consists of three steps, with two initialization
stages. First, we perform self-supervised initialization of a spatial
transformer using short videos as input. Additionally, we initialize a temporal
transformer that enhances the spatial transformer's outputs with global context
through knowledge distillation from a pre-computed feature bank aligned with
each short video segment. In the final step, we fine-tune the transformers to
the action spotting task. The experiments, conducted on the SoccerNet-v2
dataset, demonstrate state-of-the-art performance and validate the
effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several
advantages of our pretraining pipeline, including improved performance and
faster convergence compared to non-pretrained models.
Related papers
- Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers.
We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students)
Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - An Efficient Spatio-Temporal Pyramid Transformer for Action Detection [40.68615998427292]
We present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) video framework for action detection.
Specifically, we propose to use local window attention to encode local-temporal rich-time representations in the early stages while applying global attention to capture long-term space-time dependencies in the later stages.
In this way, ourSTPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2022-07-21T12:38:05Z) - Continual Transformers: Redundancy-Free Attention for Online Inference [86.3361797111839]
We propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream.
Our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention.
arXiv Detail & Related papers (2022-01-17T08:20:09Z) - Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer [41.44769642537572]
Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs.
We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-03T10:52:06Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.