Related papers: Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

URL: http://arxiv.org/abs/2510.23640v1
Date: Fri, 24 Oct 2025 17:27:10 GMT
Title: Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning
Authors: Zihao Jing, Yan Sun, Yan Yi Li, Sugitha Janarthanan, Alana Deng, Pingzhao Hu,
Abstract summary: Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization.<n>We propose MuMo, a structured multimodal fusion framework that addresses these challenges through two key strategies.<n>Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation.
Score: 5.909755629383169
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.

Related papers

MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion [0.0]
MolFM-Lite is a multi-modal model that encodes SELFIESAUC sequences (1D), molecular graphs (2D), and conformer ensembles (3D)<n>Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann priors over multiple RDKit-generated conformers; and (2) a cross-modal fusion layer where each modality can attend to others.
arXiv Detail & Related papers (2026-02-25T20:59:14Z)
Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification [69.87877580725768]
Multimodal Visual Surrogate Compression (MVSC) learns to compress and adapt large 3D sMRI volumes into compact 2D features.<n>MVSC has two key components: a Volume Context that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner.
arXiv Detail & Related papers (2026-01-29T13:05:46Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z)
RMMSS: Towards Advanced Robust Multi-Modal Semantic Segmentation with Hybrid Prototype Distillation and Feature Selection [9.418241223504252]
We present RMMSS, a two-stage framework designed to enhance model robustness under missing-modality conditions.<n>It comprises two key components: the Hybrid Prototype Distillation Module (HPDM) and the Feature Selection Module (FSM)<n>Our experiments on three datasets demonstrate that our method improves missing-modality performance by 2.80%, 3.89%, and 0.89%, respectively.
arXiv Detail & Related papers (2025-05-19T08:46:03Z)
Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling [90.23688195918432]
3D molecule generation is crucial for drug discovery and material science.<n>Existing approaches typically maintain separate latent spaces for invariant and equivariant modalities.<n>We propose textbfUAE-3D, a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space.
arXiv Detail & Related papers (2025-03-19T08:56:13Z)
DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z)
Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model [18.19558762805031]
This paper proposes the Coupled SSM model, for coupling state chains of multiple modalities while maintaining independence of intra-modality state processes. Experiments on CMU-EI, CH-SIMS, CH-SIMSV2 through multi-domain input verify the effectiveness of our model. Results demonstrate that Coupled Mamba model is capable of enhanced multi-modal fusion.
arXiv Detail & Related papers (2024-05-28T09:57:03Z)
MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space [31.53831043892904]
MolCRAFT is the first structure-based drug design model to operate in the continuous parameter space. It consistently achieves superior performance in binding affinity with more stable 3D structure.
arXiv Detail & Related papers (2024-04-18T12:43:39Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.