I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal
Mutual Distillation
- URL: http://arxiv.org/abs/2310.15568v1
- Date: Tue, 24 Oct 2023 07:22:17 GMT
- Title: I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal
Mutual Distillation
- Authors: Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang,
Houqiang Li
- Abstract summary: We introduce a general Inter- and Intra-modal Mutual Distillation (I$2$MD) framework.
In I$2$MD, we first re-formulate the cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process.
To alleviate the interference of similar samples and exploit their underlying contexts, we further design the Intra-modal Mutual Distillation (IMD) strategy.
- Score: 147.2183428328396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progresses on self-supervised 3D human action representation learning
are largely attributed to contrastive learning. However, in conventional
contrastive frameworks, the rich complementarity between different skeleton
modalities remains under-explored. Moreover, optimized with distinguishing
self-augmented samples, models struggle with numerous similar positive
instances in the case of limited action categories. In this work, we tackle the
aforementioned problems by introducing a general Inter- and Intra-modal Mutual
Distillation (I$^2$MD) framework. In I$^2$MD, we first re-formulate the
cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process.
Different from existing distillation solutions that transfer the knowledge of a
pre-trained and fixed teacher to the student, in CMD, the knowledge is
continuously updated and bidirectionally distilled between modalities during
pre-training. To alleviate the interference of similar samples and exploit
their underlying contexts, we further design the Intra-modal Mutual
Distillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA)
mechanism is first introduced, where an additional cluster-level discrimination
branch is instantiated in each modality. It adaptively aggregates
highly-correlated neighboring features, forming local cluster-level
contrasting. Mutual distillation is then performed between the two branches for
cross-level knowledge exchange. Extensive experiments on three datasets show
that our approach sets a series of new records.
Related papers
- DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning [3.763772992906958]
Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch.
DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation) explicitly models different types of per-modality information.
arXiv Detail & Related papers (2024-08-05T13:44:15Z) - Unified Molecular Modeling via Modality Blending [35.16755562674055]
We introduce a novel "blend-then-predict" self-supervised learning method (MoleBLEND)
MoleBLEND blends atom relations from different modalities into one unified relation for matrix encoding, then recovers modality-specific information for both 2D and 3D structures.
Experiments show that MoleBLEND achieves state-of-the-art performance across major 2D/3D benchmarks.
arXiv Detail & Related papers (2023-07-12T15:27:06Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Improving the Modality Representation with Multi-View Contrastive
Learning for Multimodal Sentiment Analysis [15.623293264871181]
This study investigates the improvement approaches of modality representation with contrastive learning.
We devise a three-stages framework with multi-view contrastive learning to refine representations for the specific objectives.
We conduct experiments on three open datasets, and results show the advance of our model.
arXiv Detail & Related papers (2022-10-28T01:25:16Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z) - Unpaired Multi-modal Segmentation via Knowledge Distillation [77.39798870702174]
We propose a novel learning scheme for unpaired cross-modality image segmentation.
In our method, we heavily reuse network parameters, by sharing all convolutional kernels across CT and MRI.
We have extensively validated our approach on two multi-class segmentation problems.
arXiv Detail & Related papers (2020-01-06T20:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.