Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding
- URL: http://arxiv.org/abs/2311.03106v1
- Date: Mon, 6 Nov 2023 13:56:57 GMT
- Title: Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding
- Authors: Shengkai Sun, Daizong Liu, Jianfeng Dong, Xiaoye Qu, Junyu Gao, Xun
Yang, Xun Wang, Meng Wang
- Abstract summary: Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
- Score: 62.70450216120704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised pre-training has shown great success in skeleton-based action
understanding recently. Existing works typically train separate
modality-specific models, then integrate the multi-modal information for action
understanding by a late-fusion strategy. Although these approaches have
achieved significant performance, they suffer from the complex yet redundant
multi-stream model designs, each of which is also limited to the fixed input
skeleton modality. To alleviate these issues, in this paper, we propose a
Unified Multimodal Unsupervised Representation Learning framework, called
UmURL, which exploits an efficient early-fusion strategy to jointly encode the
multi-modal features in a single-stream manner. Specifically, instead of
designing separate modality-specific optimization processes for uni-modal
unsupervised learning, we feed different modality inputs into the same stream
with an early-fusion strategy to learn their multi-modal features for reducing
model complexity. To ensure that the fused multi-modal features do not exhibit
modality bias, i.e., being dominated by a certain modality input, we further
propose both intra- and inter-modal consistency learning to guarantee that the
multi-modal features contain the complete semantics of each modal via feature
decomposition and distinct alignment. In this manner, our framework is able to
learn the unified representations of uni-modal or multi-modal skeleton input,
which is flexible to different kinds of modality input for robust action
understanding in practical cases. Extensive experiments conducted on three
large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that
UmURL is highly efficient, possessing the approximate complexity with the
uni-modal methods, while achieving new state-of-the-art performance across
various downstream task scenarios in skeleton-based action representation
learning.
Related papers
- On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances from multiple modalities when only a few labeled examples are available.
We propose a Generative Transfer Learning framework consisting of two stages: the first involves training on abundant unimodal data, and the second focuses on transfer learning to adapt to novel data.
Our finds demonstrate that GTL has superior performance compared to state-of-the-art methods across four distinct multi-modal datasets.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations [16.036997801745905]
Multimodal learning plays a crucial role in enabling machine learning models to fuse and utilize diverse data sources.
Recent binding methods, such as ImageBind, typically use a fixed anchor modality to align multimodal data in the anchor modal embedding space.
We propose CentroBind, a simple yet powerful approach that eliminates the need for a fixed anchor.
arXiv Detail & Related papers (2024-10-02T23:19:23Z) - Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges.
This often leads to the issue of missing modalities, where data for certain modalities are absent.
We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z) - Asynchronous Multimodal Video Sequence Fusion via Learning Modality-Exclusive and -Agnostic Representations [19.731611716111566]
We propose a Multimodal fusion approach for learning modality-Exclusive and modality-Agnostic representations.
We introduce a predictive self-attention module to capture reliable context dynamics within modalities.
A hierarchical cross-modal attention module is designed to explore valuable element correlations among modalities.
A double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.
arXiv Detail & Related papers (2024-07-06T04:36:48Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.