Related papers: InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

URL: http://arxiv.org/abs/2509.25270v2
Date: Sat, 04 Oct 2025 06:25:29 GMT
Title: InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
Authors: Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan,
Abstract summary: In multimodal representation, synergistic interactions between modalities provide complementary information and create unique outcomes.<n>Existing methods may struggle to capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical.<n>We introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy.
Score: 66.45467539731288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

Related papers

Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations [4.67724003380452]
Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction.<n>While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information.<n>Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage.<n>We introduce textbfCOrAL, a principled framework
arXiv Detail & Related papers (2026-02-16T18:06:53Z)
Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation [33.45651294176388]
MDiTFace is a customized diffusion transformer framework that employs a unified tokenization strategy to process semantic mask and text inputs.<n>Extensive experiments demonstrate that MDiTFace significantly outperforms other competing methods in terms of both facial fidelity and conditional consistency.
arXiv Detail & Related papers (2025-11-16T14:52:54Z)
Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation [19.01114538768217]
We propose a novel framework for textbfRtextbfEfining multi-modtextbfAl conttextbfRastive learning and hotextbfMography relations.<n>Our experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-19T11:35:48Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
ColorMAE: Exploring data-independent masking strategies in Masked AutoEncoders [53.3185750528969]
Masked AutoEncoders (MAE) have emerged as a robust self-supervised framework. We introduce a data-independent method, termed ColorMAE, which generates different binary mask patterns by filtering random noise. We demonstrate our strategy's superiority in downstream tasks compared to random masking.
arXiv Detail & Related papers (2024-07-17T22:04:00Z)
Beyond Random Missingness: Clinically Rethinking for Healthcare Time Series Imputation [7.21960656196858]
This study investigates the impact of masking strategies on time series imputation models in healthcare settings.<n>Using the PhysioNet Challenge 2012 dataset, we analyse how different masking implementations affect both imputation accuracy and downstream clinical predictions.
arXiv Detail & Related papers (2024-05-26T18:05:12Z)
Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation [51.80447197290866]
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given knowledge graphs.<n>Existing MMKGC methods usually extract multi-modal features with pre-trained models.<n>We introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities.
arXiv Detail & Related papers (2024-04-15T05:40:41Z)
Unity by Diversity: Improved Representation Learning in Multimodal VAEs [24.85691124169784]
We show that a better latent representation can be obtained by replacing hard constraints with a soft constraint.<n>We show improved learned latent representations and imputation of missing data modalities compared to existing methods.
arXiv Detail & Related papers (2024-03-08T13:29:46Z)
MultiFIX: An XAI-friendly feature inducing approach to building models from multimodal data [0.0]
MultiFIX is a new interpretability-focused multimodal data fusion pipeline. An end-to-end deep learning architecture is used to train a predictive model. We apply MultiFIX to a publicly available dataset for the detection of malignant skin lesions.
arXiv Detail & Related papers (2024-02-19T14:45:46Z)
MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE) MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z)
Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images. We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy. Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Contextual Fusion For Adversarial Robustness [0.0]
Deep neural networks are usually designed to process one particular information stream and susceptible to various types of adversarial perturbations. We developed a fusion model using a combination of background and foreground features extracted in parallel from Places-CNN and Imagenet-CNN. For gradient based attacks, our results show that fusion allows for significant improvements in classification without decreasing performance on unperturbed data.
arXiv Detail & Related papers (2020-11-18T20:13:23Z)
Panoptic Feature Fusion Net: A Novel Instance Segmentation Paradigm for Biomedical and Biological Images [91.41909587856104]
We present a Panoptic Feature Fusion Net (PFFNet) that unifies the semantic and instance features in this work. Our proposed PFFNet contains a residual attention feature fusion mechanism to incorporate the instance prediction with the semantic features. It outperforms several state-of-the-art methods on various biomedical and biological datasets.
arXiv Detail & Related papers (2020-02-15T09:19:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.