DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding
- URL: http://arxiv.org/abs/2503.19012v1
- Date: Mon, 24 Mar 2025 17:58:09 GMT
- Title: DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding
- Authors: Lingyan Ran, Lidong Wang, Guangcong Wang, Peng Wang, Yanning Zhang,
- Abstract summary: We introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM)<n>PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength.<n>VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions.
- Score: 43.85632218045282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR's excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at https://github.com/LidongWang-26/DiffV2IR.
Related papers
- F-ViTA: Foundation Model Guided Visible to Thermal Translation [27.200043694866388]
We propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation.
Our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image.
arXiv Detail & Related papers (2025-04-03T17:47:06Z) - CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.
We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.
We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z) - Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z) - DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution [32.53713932204663]
DifIISR is an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance.<n>We introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity.<n>We incorporate various visual foundational models as the perceptual guidance for downstream visual tasks.
arXiv Detail & Related papers (2025-03-03T05:20:57Z) - Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z) - IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks [47.08388430506686]
"IV-tuning" is a novel and general fine-tuning approach to parameter-efficiently harness PVMs for infrared-visible tasks.<n>At its core, IV-tuning freezes pre-trained visible-based PVMs and integrates infrared flow into modal prompts to interact with adapters.<n>By fine-tuning approximately 3% of the backbone parameters, IV-tuning outperforms full fine-tuning and previous state-of-the-art methods.
arXiv Detail & Related papers (2024-12-21T14:54:41Z) - CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain [7.007302908953179]
Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions.
As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene.
arXiv Detail & Related papers (2024-11-25T12:23:14Z) - Contourlet Refinement Gate Framework for Thermal Spectrum Distribution Regularized Infrared Image Super-Resolution [54.293362972473595]
Image super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts.
Current approaches to address SR tasks are either dedicated to extracting RGB image features or assuming similar degradation patterns.
We propose a Contourlet refinement gate framework to restore infrared modal-specific features while preserving spectral distribution fidelity.
arXiv Detail & Related papers (2024-11-19T14:24:03Z) - Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation [0.536022165180739]
We propose a novel image-to-image translation framework, Pix2Next, to generate high-quality Near-Infrared (NIR) images from RGB inputs.
A multi-scale PatchGAN discriminator ensures realistic image generation at various detail levels, while carefully designed loss functions couple global context understanding with local feature preservation.
The proposed approach enables the scaling up of NIR datasets without additional data acquisition or annotation efforts, potentially accelerating advancements in NIR-based computer vision applications.
arXiv Detail & Related papers (2024-09-25T07:51:47Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - I2V-GAN: Unpaired Infrared-to-Visible Video Translation [14.156053075519207]
We propose an infrared-to-visible (I2V) video translation method I2V-GAN to generate visible light videos by given unpaired infrared videos.
Our model capitalizes on three types of constraints: 1)adversarial constraint to generate synthetic frames that are similar to the real ones, 2)cyclic consistency with the introduced perceptual loss for effective content conversion, and 3)similarity constraints across and within domains.
Experiments validate that I2V-GAN is superior to the compared SOTA methods in the translation of I2V videos with higher fluency and finer semantic details.
arXiv Detail & Related papers (2021-08-02T14:04:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.