Versatile Multi-Modal Pre-Training for Human-Centric Perception
- URL: http://arxiv.org/abs/2203.13815v1
- Date: Fri, 25 Mar 2022 17:58:29 GMT
- Title: Versatile Multi-Modal Pre-Training for Human-Centric Perception
- Authors: Fangzhou Hong, Liang Pan, Zhongang Cai, Ziwei Liu
- Abstract summary: We propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo for effective representation learning.
Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space.
Experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo.
- Score: 32.62404509079062
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human-centric perception plays a vital role in vision and graphics. But their
data annotations are prohibitively expensive. Therefore, it is desirable to
have a versatile pre-train model that serves as a foundation for data-efficient
downstream tasks transfer. To this end, we propose the Human-Centric
Multi-Modal Contrastive Learning framework HCMoCo that leverages the
multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective
representation learning. The objective comes with two main challenges: dense
pre-train for multi-modality data, efficient usage of sparse human priors. To
tackle the challenges, we design the novel Dense Intra-sample Contrastive
Learning and Sparse Structure-aware Contrastive Learning targets by
hierarchically learning a modal-invariant latent space featured with continuous
and ordinal feature distribution and structure-aware semantic consistency.
HCMoCo provides pre-train for different modalities by combining heterogeneous
datasets, which allows efficient usage of existing task-specific human data.
Extensive experiments on four downstream tasks of different modalities
demonstrate the effectiveness of HCMoCo, especially under data-efficient
settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing).
Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality
supervision and missing-modality inference, validating its strong ability in
cross-modal association and reasoning.
Related papers
- Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - Transferable Unsupervised Outlier Detection Framework for Human Semantic Trajectories [9.816270572121724]
We propose Transferable Outlier Detection for Human Semantic Trajectories (TOD4Traj) framework.
ToD4Traj first introduces a modality feature unification module to align diverse data feature representations.
A contrastive learning module is further pro-posed for identifying regular mobility patterns both temporally and across populations.
arXiv Detail & Related papers (2024-09-28T22:31:00Z) - Multi-OCT-SelfNet: Integrating Self-Supervised Learning with Multi-Source Data Fusion for Enhanced Multi-Class Retinal Disease Classification [2.5091334993691206]
Development of a robust deep-learning model for retinal disease diagnosis requires a substantial dataset for training.
The capacity to generalize effectively on smaller datasets remains a persistent challenge.
We've combined a wide range of data sources to improve performance and generalization to new data.
arXiv Detail & Related papers (2024-09-17T17:22:35Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity
Recognition [34.424960016807795]
Multi-modal Human Activity Recognition could utilize the complementary information to build models that can generalize well.
Deep learning methods have shown promising results, their potential in extracting salient multi-modal spatial-temporal features has not been fully explored.
A knowledge distillation-based Multi-modal Mid-Fusion approach, DMFT, is proposed to conduct informative feature extraction and fusion to resolve the Multi-modal Human Activity Recognition task efficiently.
arXiv Detail & Related papers (2023-05-05T19:26:06Z) - Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
Driving [103.745551954983]
In this paper, we investigate the transfer performance of various types of self-supervised methods, including MoCo and SimCLR, on three downstream tasks.
We find that their performances are sub-optimal or even lag far behind the single-task baseline.
We propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training.
arXiv Detail & Related papers (2022-09-19T12:15:31Z) - Contrastive Learning with Cross-Modal Knowledge Mining for Multimodal
Human Activity Recognition [1.869225486385596]
We explore the hypothesis that leveraging multiple modalities can lead to better recognition.
We extend a number of recent contrastive self-supervised approaches for the task of Human Activity Recognition.
We propose a flexible, general-purpose framework for performing multimodal self-supervised learning.
arXiv Detail & Related papers (2022-05-20T10:39:16Z) - Modality-specific Distillation [30.190082262375395]
We propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets.
Our idea aims at mimicking a teacher's modality-specific predictions by introducing an auxiliary loss term for each modality.
Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses.
arXiv Detail & Related papers (2021-01-06T05:45:07Z) - Task-Feature Collaborative Learning with Application to Personalized
Attribute Prediction [166.87111665908333]
We propose a novel multi-task learning method called Task-Feature Collaborative Learning (TFCL)
Specifically, we first propose a base model with a heterogeneous block-diagonal structure regularizer to leverage the collaborative grouping of features and tasks.
As a practical extension, we extend the base model by allowing overlapping features and differentiating the hard tasks.
arXiv Detail & Related papers (2020-04-29T02:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.