Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning
- URL: http://arxiv.org/abs/2510.23640v1
- Date: Fri, 24 Oct 2025 17:27:10 GMT
- Title: Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning
- Authors: Zihao Jing, Yan Sun, Yan Yi Li, Sugitha Janarthanan, Alana Deng, Pingzhao Hu,
- Abstract summary: Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization.<n>We propose MuMo, a structured multimodal fusion framework that addresses these challenges through two key strategies.<n>Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation.
- Score: 5.909755629383169
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.
Related papers
- MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion [0.0]
MolFM-Lite is a multi-modal model that encodes SELFIESAUC sequences (1D), molecular graphs (2D), and conformer ensembles (3D)<n>Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann priors over multiple RDKit-generated conformers; and (2) a cross-modal fusion layer where each modality can attend to others.
arXiv Detail & Related papers (2026-02-25T20:59:14Z) - Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification [69.87877580725768]
Multimodal Visual Surrogate Compression (MVSC) learns to compress and adapt large 3D sMRI volumes into compact 2D features.<n>MVSC has two key components: a Volume Context that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner.
arXiv Detail & Related papers (2026-01-29T13:05:46Z) - FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z) - Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z) - RMMSS: Towards Advanced Robust Multi-Modal Semantic Segmentation with Hybrid Prototype Distillation and Feature Selection [9.418241223504252]
We present RMMSS, a two-stage framework designed to enhance model robustness under missing-modality conditions.<n>It comprises two key components: the Hybrid Prototype Distillation Module (HPDM) and the Feature Selection Module (FSM)<n>Our experiments on three datasets demonstrate that our method improves missing-modality performance by 2.80%, 3.89%, and 0.89%, respectively.
arXiv Detail & Related papers (2025-05-19T08:46:03Z) - Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling [90.23688195918432]
3D molecule generation is crucial for drug discovery and material science.<n>Existing approaches typically maintain separate latent spaces for invariant and equivariant modalities.<n>We propose textbfUAE-3D, a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space.
arXiv Detail & Related papers (2025-03-19T08:56:13Z) - DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model [18.19558762805031]
This paper proposes the Coupled SSM model, for coupling state chains of multiple modalities while maintaining independence of intra-modality state processes.
Experiments on CMU-EI, CH-SIMS, CH-SIMSV2 through multi-domain input verify the effectiveness of our model.
Results demonstrate that Coupled Mamba model is capable of enhanced multi-modal fusion.
arXiv Detail & Related papers (2024-05-28T09:57:03Z) - MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space [31.53831043892904]
MolCRAFT is the first structure-based drug design model to operate in the continuous parameter space.
It consistently achieves superior performance in binding affinity with more stable 3D structure.
arXiv Detail & Related papers (2024-04-18T12:43:39Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.