Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition
- URL: http://arxiv.org/abs/2512.21064v1
- Date: Wed, 24 Dec 2025 09:10:04 GMT
- Title: Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition
- Authors: Hongsong Wang, Heng Fei, Bingxuan Dai, Jie Gui,
- Abstract summary: We introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition.<n>The proposed method strikes an excellent balance between computational cost and model performance.
- Score: 23.335246964537223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.
Related papers
- Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models [6.350443894942629]
Multimodal Weight Allocation Module (MWAM) is a plug-and-play component that dynamically re-balances the contribution of each branch during training.<n>MWAM delivers consistent performance gains across a wide range of tasks and modality combinations.
arXiv Detail & Related papers (2026-02-26T05:51:41Z) - Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process [55.91649771370862]
Dirichlet process (DP) mixture model is a powerful non-parametric method that can amplify the most prominent features.<n>We propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment.
arXiv Detail & Related papers (2025-10-23T16:53:24Z) - Mixup Helps Understanding Multimodal Video Better [12.281180208753021]
Multimodal models are prone to overfitting strong modalities, which can dominate learning and suppress the contributions of weaker ones.<n>We propose Multimodal Mixup (MM), which applies the Mixup strategy at the aggregated multimodal feature level to mitigate overfitting.<n>We also introduce Balanced Multimodal Mixup (B-MM), which dynamically adjusts the mixing ratios for each modality based on their relative contributions to the learning objective.
arXiv Detail & Related papers (2025-10-13T03:53:25Z) - EPIC: Efficient Prompt Interaction for Text-Image Classification [12.278321407201458]
Large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities.<n>We propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC)<n>Our method achieves reduced computational resource consumption and fewer trainable parameters compared to other fine-tuning strategies.
arXiv Detail & Related papers (2025-07-10T04:15:44Z) - Asymmetric Reinforcing against Multi-modal Representation Bias [59.685072206359855]
We propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM)<n>Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information.<n>We have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
arXiv Detail & Related papers (2025-01-02T13:00:06Z) - Balancing Multimodal Training Through Game-Theoretic Regularization [26.900302082724295]
Multimodal learning holds promise for richer information extraction by capturing dependencies across data sources.<n>Yet, current training methods often underperform due to modality competition.<n>This paper proposes the Multimodal Competition Regularizer (MCR), inspired by a mutual information (MI) decomposition.
arXiv Detail & Related papers (2024-11-11T19:53:05Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.