MMOne: Representing Multiple Modalities in One Scene
- URL: http://arxiv.org/abs/2507.11129v2
- Date: Thu, 17 Jul 2025 09:45:51 GMT
- Title: MMOne: Representing Multiple Modalities in One Scene
- Authors: Zhifeng Gu, Bing Wang,
- Abstract summary: We propose a general framework, MMOne, to represent multiple modalities in one scene.<n>Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality.<n>We also design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences.
- Score: 2.617962830559083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.
Related papers
- Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - What to align in multimodal contrastive learning? [7.7439394183358745]
We introduce Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space.<n>Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation.<n>In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the seven multimodal benchmarks.
arXiv Detail & Related papers (2024-09-11T16:42:22Z) - Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities [17.723207830420996]
Multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing.
We propose a robust textual-visual multimodal learning method, Chameleon, that completely deviates from the conventional multi-branch design.
Experiments are performed on four popular datasets including Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta.
arXiv Detail & Related papers (2024-07-23T07:29:57Z) - Learning Modality-agnostic Representation for Semantic Segmentation from Any Modalities [8.517830626176641]
Any2Seg is a novel framework that can achieve robust segmentation from any combination of modalities in any visual conditions.
Experiments on two benchmarks with four modalities demonstrate that Any2Seg achieves the state-of-the-art under the multi-modal setting.
arXiv Detail & Related papers (2024-07-16T03:34:38Z) - It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation [19.826293335983145]
We propose Separate Learning (SEA) for multimedia recommendation, which mainly includes mutual information view of modal-unique and -generic learning.<n>Specifically, we first use GNN to learn the representations of users and items in different modalities and split each modal representation into generic and unique parts. Then, we design Solosimloss to maximize the lower bound of mutual information, to align the general parts of different modalities, thus learning more high-quality modal-generic features.
arXiv Detail & Related papers (2024-06-12T14:35:43Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.