Learning an Augmented RGB Representation with Cross-Modal Knowledge
Distillation for Action Detection
- URL: http://arxiv.org/abs/2108.03619v1
- Date: Sun, 8 Aug 2021 12:04:14 GMT
- Title: Learning an Augmented RGB Representation with Cross-Modal Knowledge
Distillation for Action Detection
- Authors: Rui Dai, Srijan Das, Francois Bremond
- Abstract summary: Action detection requires not only categorizing actions, but also localizing them in untrimmed videos.
We propose a cross-modal knowledge distillation framework consisting of two levels of distillation.
Our proposed framework is generic and outperforms other popular cross-modal distillation methods in action detection task.
- Score: 7.616556723260849
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In video understanding, most cross-modal knowledge distillation (KD) methods
are tailored for classification tasks, focusing on the discriminative
representation of the trimmed videos. However, action detection requires not
only categorizing actions, but also localizing them in untrimmed videos.
Therefore, transferring knowledge pertaining to temporal relations is critical
for this task which is missing in the previous cross-modal KD frameworks. To
this end, we aim at learning an augmented RGB representation for action
detection, taking advantage of additional modalities at training time through
KD. We propose a KD framework consisting of two levels of distillation. On one
hand, atomic-level distillation encourages the RGB student to learn the
sub-representation of the actions from the teacher in a contrastive manner. On
the other hand, sequence-level distillation encourages the student to learn the
temporal knowledge from the teacher, which consists of transferring the Global
Contextual Relations and the Action Boundary Saliency. The result is an
Augmented-RGB stream that can achieve competitive performance as the two-stream
network while using only RGB at inference time. Extensive experimental analysis
shows that our proposed distillation framework is generic and outperforms other
popular cross-modal distillation methods in action detection task.
Related papers
- Object-centric Cross-modal Feature Distillation for Event-based Object
Detection [87.50272918262361]
RGB detectors still outperform event-based detectors due to sparsity of the event data and missing visual details.
We develop a novel knowledge distillation approach to shrink the performance gap between these two modalities.
We show that object-centric distillation allows to significantly improve the performance of the event-based student object detector.
arXiv Detail & Related papers (2023-11-09T16:33:08Z) - Decomposed Cross-modal Distillation for RGB-based Temporal Action
Detection [23.48709176879878]
Temporal action detection aims to predict the time intervals and the classes of action instances in the video.
Existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow.
We introduce a cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality.
arXiv Detail & Related papers (2023-03-30T10:47:26Z) - DETRDistill: A Universal Knowledge Distillation Framework for
DETR-families [11.9748352746424]
Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations.
Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
arXiv Detail & Related papers (2022-11-17T13:35:11Z) - Students taught by multimodal teachers are superior action recognizers [41.821485757189656]
The focal point of egocentric video understanding is modelling hand-object interactions.
Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc.
The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time.
arXiv Detail & Related papers (2022-10-09T19:37:17Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - EvDistill: Asynchronous Events to End-task Learning via Bidirectional
Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur.
We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data.
We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z) - Bridging the gap between Human Action Recognition and Online Action
Detection [0.456877715768796]
Action recognition, early prediction, and online action detection are complementary disciplines that are often studied independently.
We address the task-specific feature extraction with a teacher-student framework between the aforementioned disciplines.
Our network embeds online early prediction and online temporal segment proposalworks in parallel.
arXiv Detail & Related papers (2021-01-21T21:01:46Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Collaborative Distillation in the Parameter and Spectrum Domains for
Video Action Recognition [79.60708268515293]
This paper explores how to train small and efficient networks for action recognition.
We propose two distillation strategies in the frequency domain, namely the feature spectrum and parameter distribution distillations respectively.
Our method can achieve higher performance than state-of-the-art methods with the same backbone.
arXiv Detail & Related papers (2020-09-15T07:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.