Decomposed Cross-modal Distillation for RGB-based Temporal Action
Detection
- URL: http://arxiv.org/abs/2303.17285v1
- Date: Thu, 30 Mar 2023 10:47:26 GMT
- Title: Decomposed Cross-modal Distillation for RGB-based Temporal Action
Detection
- Authors: Pilhyeon Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Hyeran Byun
- Abstract summary: Temporal action detection aims to predict the time intervals and the classes of action instances in the video.
Existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow.
We introduce a cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality.
- Score: 23.48709176879878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal action detection aims to predict the time intervals and the classes
of action instances in the video. Despite the promising performance, existing
two-stream models exhibit slow inference speed due to their reliance on
computationally expensive optical flow. In this paper, we introduce a
decomposed cross-modal distillation framework to build a strong RGB-based
detector by transferring knowledge of the motion modality. Specifically,
instead of direct distillation, we propose to separately learn RGB and motion
representations, which are in turn combined to perform action localization. The
dual-branch design and the asymmetric training objectives enable effective
motion knowledge transfer while preserving RGB information intact. In addition,
we introduce a local attentive fusion to better exploit the multimodal
complementarity. It is designed to preserve the local discriminability of the
features that is important for action localization. Extensive experiments on
the benchmarks verify the effectiveness of the proposed method in enhancing
RGB-based action detectors. Notably, our framework is agnostic to backbones and
detection heads, bringing consistent gains across different model combinations.
Related papers
- Object-centric Cross-modal Feature Distillation for Event-based Object
Detection [87.50272918262361]
RGB detectors still outperform event-based detectors due to sparsity of the event data and missing visual details.
We develop a novel knowledge distillation approach to shrink the performance gap between these two modalities.
We show that object-centric distillation allows to significantly improve the performance of the event-based student object detector.
arXiv Detail & Related papers (2023-11-09T16:33:08Z) - Prior-enhanced Temporal Action Localization using Subject-aware Spatial
Attention [26.74864808534721]
Temporal action localization (TAL) aims to detect the boundary and identify the class of each action instance in a long untrimmed video.
Current approaches treat video frames homogeneously, and tend to give background and key objects excessive attention.
We propose a prior-enhanced temporal action localization method (PETAL), which only takes in RGB input and incorporates action subjects as priors.
arXiv Detail & Related papers (2022-11-10T02:27:30Z) - CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient
Object Detection [144.66411561224507]
We present a convolutional neural network (CNN) model, named CIR-Net, based on the novel cross-modality interaction and refinement.
Our network outperforms the state-of-the-art saliency detectors both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-10-06T11:59:19Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based
Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation.
We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z) - BAANet: Learning Bi-directional Adaptive Attention Gates for
Multispectral Pedestrian Detection [14.672188805059744]
This work proposes an effective and efficient cross-modality fusion module called Bi-directional Adaptive Gate (BAA-Gate)
Based on the attention mechanism, the BAA-Gate is devised to distill the informative features and recalibrate the representationsally.
Considerable experiments on the challenging KAIST dataset demonstrate the superior performance of our method with satisfactory speed.
arXiv Detail & Related papers (2021-12-04T08:30:54Z) - Learning an Augmented RGB Representation with Cross-Modal Knowledge
Distillation for Action Detection [7.616556723260849]
Action detection requires not only categorizing actions, but also localizing them in untrimmed videos.
We propose a cross-modal knowledge distillation framework consisting of two levels of distillation.
Our proposed framework is generic and outperforms other popular cross-modal distillation methods in action detection task.
arXiv Detail & Related papers (2021-08-08T12:04:14Z) - PAN: Towards Fast Action Recognition via Learning Persistence of
Appearance [60.75488333935592]
Most state-of-the-art methods heavily rely on dense optical flow as motion representation.
In this paper, we shed light on fast action recognition by lifting the reliance on optical flow.
We design a novel motion cue called Persistence of Appearance (PA)
In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries.
arXiv Detail & Related papers (2020-08-08T07:09:54Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.