TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
- URL: http://arxiv.org/abs/2602.19430v2
- Date: Tue, 24 Feb 2026 03:51:01 GMT
- Title: TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation
- Authors: Dong-Guw Lee, Tai Hyoung Rhee, Hyunsoo Jang, Young-Sik Shin, Ukcheol Shin, Ayoung Kim,
- Abstract summary: TherA is a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level.<n>TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
- Score: 12.591408054941027
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
Related papers
- RAW-Flow: Advancing RGB-to-RAW Image Reconstruction with Deterministic Latent Flow Matching [55.03149221192589]
We introduce a novel framework named RAW-Flow to bridge the gap between RGB and RAW representations.<n>We also introduce a cross-scale context guidance module that injects hierarchical RGB features into the flow estimation process.<n> RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually.
arXiv Detail & Related papers (2026-01-28T08:27:38Z) - ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation [14.108149959967095]
Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks.<n>To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution.<n>We propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation.
arXiv Detail & Related papers (2025-09-29T14:55:51Z) - Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z) - PID: Physics-Informed Diffusion Model for Infrared Image Generation [11.416759828137701]
Infrared imaging technology has gained significant attention for its reliable sensing ability in low visibility conditions.<n>Most existing image translation methods treat infrared images as a stylistic variation, neglecting the underlying physical laws.<n>We propose a Physics-Informed Diffusion (PID) model for translating RGB images to infrared images that adhere to physical laws.
arXiv Detail & Related papers (2024-07-12T14:32:30Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Visible to Thermal image Translation for improving visual task in low
light conditions [0.0]
We have collected images from two different locations using the Parrot Anafi Thermal drone.
We created a two-stream network, preprocessed, augmented, the image data, and trained the generator and discriminator models from scratch.
The findings demonstrate that it is feasible to translate RGB training data to thermal data using GAN.
arXiv Detail & Related papers (2023-10-31T05:18:53Z) - HalluciDet: Hallucinating RGB Modality for Person Detection Through Privileged Information [12.376615603048279]
HalluciDet is an IR-RGB image translation model for object detection.
We empirically compare our approach against state-of-the-art methods for image translation and for fine-tuning on IR.
arXiv Detail & Related papers (2023-10-07T03:00:33Z) - Breaking Modality Disparity: Harmonized Representation for Infrared and
Visible Image Registration [66.33746403815283]
We propose a scene-adaptive infrared and visible image registration.
We employ homography to simulate the deformation between different planes.
We propose the first ground truth available misaligned infrared and visible image dataset.
arXiv Detail & Related papers (2023-04-12T06:49:56Z) - Edge-guided Multi-domain RGB-to-TIR image Translation for Training
Vision Tasks with Challenging Labels [12.701191873813583]
The insufficient number of annotated thermal infrared (TIR) image datasets hinders TIR image-based deep learning networks to have comparable performances to that of RGB.
We propose a modified multidomain RGB to TIR image translation model focused on edge preservation to employ annotated RGB images with challenging labels.
We have enabled the supervised learning of deep TIR image-based optical flow estimation and object detection that ameliorated in end point error by 56.5% on average and the best object detection mAP of 23.9% respectively.
arXiv Detail & Related papers (2023-01-30T06:44:38Z) - Does Thermal Really Always Matter for RGB-T Salient Object Detection? [153.17156598262656]
This paper proposes a network named TNet to solve the RGB-T salient object detection (SOD) task.
In this paper, we introduce a global illumination estimation module to predict the global illuminance score of the image.
On the other hand, we introduce a two-stage localization and complementation module in the decoding phase to transfer object localization cue and internal integrity cue in thermal features to the RGB modality.
arXiv Detail & Related papers (2022-10-09T13:50:12Z) - Mirror Complementary Transformer Network for RGB-thermal Salient Object
Detection [16.64781797503128]
RGB-thermal object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair.
In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD.
Experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-07T20:26:09Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.