F-ViTA: Foundation Model Guided Visible to Thermal Translation
- URL: http://arxiv.org/abs/2504.02801v1
- Date: Thu, 03 Apr 2025 17:47:06 GMT
- Title: F-ViTA: Foundation Model Guided Visible to Thermal Translation
- Authors: Jay N. Paranjape, Celso de Melo, Vishal M. Patel,
- Abstract summary: We propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation.<n>Our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image.
- Score: 27.200043694866388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: https://github.com/JayParanjape/F-ViTA/tree/master.
Related papers
- DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding [43.85632218045282]
We introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM)<n>PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength.<n>VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions.
arXiv Detail & Related papers (2025-03-24T17:58:09Z) - Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z) - TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation [25.542902579879367]
This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM)
Our method faithfully preserves the semantic structure of generated visible images.
Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection.
arXiv Detail & Related papers (2024-08-26T12:43:48Z) - PID: Physics-Informed Diffusion Model for Infrared Image Generation [11.416759828137701]
Infrared imaging technology has gained significant attention for its reliable sensing ability in low visibility conditions.
Most existing image translation methods treat infrared images as a stylistic variation, neglecting the underlying physical laws.
We propose a Physics-Informed Diffusion (PID) model for translating RGB images to infrared images that adhere to physical laws.
arXiv Detail & Related papers (2024-07-12T14:32:30Z) - IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection [55.554484379021524]
Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared images.
We propose the IRSAM model for IRSTD, which improves SAM's encoder-decoder architecture to learn better feature representation of infrared small objects.
arXiv Detail & Related papers (2024-07-10T10:17:57Z) - Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior [63.64088590653005]
We propose Diff-Mosaic, a data augmentation method based on the diffusion model.
We introduce an enhancement network called Pixel-Prior, which generates highly coordinated and realistic Mosaic images.
In the second stage, we propose an image enhancement strategy named Diff-Prior. This strategy utilizes diffusion priors to model images in the real-world scene.
arXiv Detail & Related papers (2024-06-02T06:23:05Z) - Taming Latent Diffusion Model for Neural Radiance Field Inpainting [63.297262813285265]
Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images.
We propose tempering the diffusion model'sity with per-scene customization and mitigating the textural shift with masked training.
Our framework yields state-of-the-art NeRF inpainting results on various real-world scenes.
arXiv Detail & Related papers (2024-04-15T17:59:57Z) - InfMAE: A Foundation Model in the Infrared Modality [38.23685358198649]
In this paper, we propose InfMAE, a foundation model in infrared modality.
We release an infrared dataset, called Inf30, to address the problem of lacking large-scale data for self-supervised learning.
We also design an information-aware masking strategy, which is suitable for infrared images.
arXiv Detail & Related papers (2024-02-01T08:02:10Z) - LDM-ISP: Enhancing Neural ISP for Low Light with Latent Diffusion Models [54.93010869546011]
We propose to leverage the pre-trained latent diffusion model to perform the neural ISP for enhancing extremely low-light images.
Specifically, to tailor the pre-trained latent diffusion model to operate on the RAW domain, we train a set of lightweight taming modules.
We observe different roles of UNet denoising and decoder reconstruction in the latent diffusion model, which inspires us to decompose the low-light image enhancement task into latent-space low-frequency content generation and decoding-phase high-frequency detail maintenance.
arXiv Detail & Related papers (2023-12-02T04:31:51Z) - Denoising Diffusion Models for Plug-and-Play Image Restoration [135.6359475784627]
This paper proposes DiffPIR, which integrates the traditional plug-and-play method into the diffusion sampling framework.
Compared to plug-and-play IR methods that rely on discriminative Gaussian denoisers, DiffPIR is expected to inherit the generative ability of diffusion models.
arXiv Detail & Related papers (2023-05-15T20:24:38Z) - Exploring Thermal Images for Object Detection in Underexposure Regions
for Autonomous Driving [67.69430435482127]
Underexposure regions are vital to construct a complete perception of the surroundings for safe autonomous driving.
The availability of thermal cameras has provided an essential alternate to explore regions where other optical sensors lack in capturing interpretable signals.
This work proposes a domain adaptation framework which employs a style transfer technique for transfer learning from visible spectrum images to thermal images.
arXiv Detail & Related papers (2020-06-01T09:59:09Z) - Bayesian Fusion for Infrared and Visible Images [26.64101343489016]
In this paper, a novel Bayesian fusion model is established for infrared and visible images.
We aim at making the fused image satisfy human visual system.
Compared with the previous methods, the novel model can generate better fused images with high-light targets and rich texture details.
arXiv Detail & Related papers (2020-05-12T14:57:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.