CM-Diff: A Single Generative Network for Bidirectional Cross-Modality Translation Diffusion Model Between Infrared and Visible Images
- URL: http://arxiv.org/abs/2503.09514v2
- Date: Thu, 07 Aug 2025 03:11:59 GMT
- Title: CM-Diff: A Single Generative Network for Bidirectional Cross-Modality Translation Diffusion Model Between Infrared and Visible Images
- Authors: Bin Hu, Chenqiang Gao, Shurui Liu, Junjie Guo, Fang Chen, Fangcen Liu, Junwei Han,
- Abstract summary: We present the bidirectional cross-modality translation diffusion model (CM-Diff) for simultaneously modeling data distributions in both the infrared and visible modalities.<n> Experimental results demonstrate the superiority of our CM-Diff over state-of-the-art methods.
- Score: 48.57429642590462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image translation is one of the crucial approaches for mitigating information deficiencies in the infrared and visible modalities, while also facilitating the enhancement of modality-specific datasets. However, existing methods for infrared and visible image translation either achieve unidirectional modality translation or rely on cycle consistency for bidirectional modality translation, which may result in suboptimal performance. In this work, we present the bidirectional cross-modality translation diffusion model (CM-Diff) for simultaneously modeling data distributions in both the infrared and visible modalities. We address this challenge by combining translation direction labels for guidance during training with cross-modality feature control. Specifically, we view the establishment of the mapping relationship between the two modalities as the process of learning data distributions and understanding modality differences, achieved through a novel Bidirectional Diffusion Training (BDT). Additionally, we propose a Statistical Constraint Inference (SCI) to ensure the generated image closely adheres to the data distribution of the target modality. Experimental results demonstrate the superiority of our CM-Diff over state-of-the-art methods, highlighting its potential for generating dual-modality datasets.
Related papers
- Dual-branch Prompting for Multimodal Machine Translation [9.903997553625253]
We propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation.<n>D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model.<n>Experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-07-23T15:22:51Z) - Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation [54.96563068182733]
We propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task.
MADM utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities.
We show that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities.
arXiv Detail & Related papers (2024-10-29T03:49:40Z) - Unsupervised Visible-Infrared ReID via Pseudo-label Correction and Modality-level Alignment [23.310509459311046]
Unsupervised visible-infrared person re-identification (UVI-ReID) has recently gained great attention due to its potential for enhancing human detection in diverse environments without labeling.
Previous methods utilize intra-modality clustering and cross-modality feature matching to achieve UVI-ReID.
arXiv Detail & Related papers (2024-04-10T02:03:14Z) - Diffusion based Zero-shot Medical Image-to-Image Translation for Cross Modality Segmentation [18.895926089773177]
Cross-modality image segmentation aims to segment the target modalities using a method designed in the source modality.
Deep generative models can translate the target modality images into the source modality, thus enabling cross-modality segmentation.
arXiv Detail & Related papers (2024-04-01T13:23:04Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Zero-shot-Learning Cross-Modality Data Translation Through Mutual
Information Guided Stochastic Diffusion [5.795193288204816]
Cross-modality data translation has attracted great interest in image computing.
This paper proposes a new unsupervised zero-shot-learning method named Mutual Information Diffusion guided cross-modality data translation Model (MIDiffusion)
We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models.
arXiv Detail & Related papers (2023-01-31T16:24:34Z) - Unsupervised Medical Image Translation with Adversarial Diffusion Models [0.2770822269241974]
Imputation of missing images via source-to-target modality translation can improve diversity in medical imaging protocols.
Here, we propose a novel method based on adversarial diffusion modeling, SynDiff, for improved performance in medical image translation.
arXiv Detail & Related papers (2022-07-17T15:53:24Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Dual Diffusion Implicit Bridges for Image-to-Image Translation [104.59371476415566]
Common image-to-image translation methods rely on joint training over data from both source and target domains.
We present Dual Diffusion Implicit Bridges (DDIBs), an image translation method based on diffusion models.
DDIBs allow translations between arbitrary pairs of source-target domains, given independently trained diffusion models on respective domains.
arXiv Detail & Related papers (2022-03-16T04:10:45Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.