Related papers: ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation

URL: http://arxiv.org/abs/2509.24878v1
Date: Mon, 29 Sep 2025 14:55:51 GMT
Title: ThermalGen: Style-Disentangled Flow-Based Generative Models for RGB-to-Thermal Image Translation
Authors: Jiuhong Xiao, Roshan Nayak, Ning Zhang, Daniel Tortei, Giuseppe Loianno,
Abstract summary: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks.<n>To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution.<n>We propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation.
Score: 14.108149959967095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Paired RGB-thermal data is crucial for visual-thermal sensor fusion and cross-modality tasks, including important applications such as multi-modal image alignment and retrieval. However, the scarcity of synchronized and calibrated RGB-thermal image pairs presents a major obstacle to progress in these areas. To overcome this challenge, RGB-to-Thermal (RGB-T) image translation has emerged as a promising solution, enabling the synthesis of thermal images from abundant RGB datasets for training purposes. In this study, we propose ThermalGen, an adaptive flow-based generative model for RGB-T image translation, incorporating an RGB image conditioning architecture and a style-disentangled mechanism. To support large-scale training, we curated eight public satellite-aerial, aerial, and ground RGB-T paired datasets, and introduced three new large-scale satellite-aerial RGB-T datasets--DJI-day, Bosonplus-day, and Bosonplus-night--captured across diverse times, sensor types, and geographic regions. Extensive evaluations across multiple RGB-T benchmarks demonstrate that ThermalGen achieves comparable or superior translation performance compared to existing GAN-based and diffusion-based methods. To our knowledge, ThermalGen is the first RGB-T image translation model capable of synthesizing thermal images that reflect significant variations in viewpoints, sensor characteristics, and environmental conditions. Project page: http://xjh19971.github.io/ThermalGen

Related papers

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation [12.591408054941027]
TherA is a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level.<n>TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
arXiv Detail & Related papers (2026-02-23T01:56:29Z)
RAW-Flow: Advancing RGB-to-RAW Image Reconstruction with Deterministic Latent Flow Matching [55.03149221192589]
We introduce a novel framework named RAW-Flow to bridge the gap between RGB and RAW representations.<n>We also introduce a cross-scale context guidance module that injects hierarchical RGB features into the flow estimation process.<n> RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually.
arXiv Detail & Related papers (2026-01-28T08:27:38Z)
KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection [35.52055285209549]
We propose a novel prompt learning-based RGB-T SOD method, named KAN-SAM, which reveals the potential of visual foundational models for RGB-T SOD tasks.<n>Specifically, we extend Segment Anything Model 2 (SAM2) for RGB-T SOD by introducing thermal features as guiding prompts through efficient and accurate Kolmogorov-Arnold Network (KAN) adapters.<n>We also introduce a mutually exclusive random masking strategy to reduce reliance on RGB data and improve generalization.
arXiv Detail & Related papers (2025-04-08T10:07:02Z)
Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.<n>Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.<n>In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z)
LapGSR: Laplacian Reconstructive Network for Guided Thermal Super-Resolution [1.747623282473278]
Fusing multiple modalities to produce high-resolution images often requires dense models with millions of parameters and a heavy computational load. We propose LapGSR, a multimodal, lightweight, generative model incorporating Laplacian image pyramids for guided thermal super-resolution.
arXiv Detail & Related papers (2024-11-12T12:23:19Z)
RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications [55.24463002889]
We focus on depth data synthesis and develop a range-aware RGB-D data simulation pipeline (RaSim) In particular, high-fidelity depth data is generated by imitating the imaging principle of real-world sensors. RaSim can be directly applied to real-world scenarios without any finetuning and excel at downstream RGB-D perception tasks.
arXiv Detail & Related papers (2024-04-05T08:52:32Z)
Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness. Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation. This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z)
Hyperspectral Image Super Resolution with Real Unaligned RGB Guidance [11.711656319221072]
We propose an HSI fusion network with heterogenous feature extractions, multi-stage feature alignments, and attentive feature fusion. Our method obtains a clear improvement over existing single-image and fusion-based super-resolution methods on quantitative assessment as well as visual comparison.
arXiv Detail & Related papers (2023-02-13T11:56:45Z)
Does Thermal Really Always Matter for RGB-T Salient Object Detection? [153.17156598262656]
This paper proposes a network named TNet to solve the RGB-T salient object detection (SOD) task. In this paper, we introduce a global illumination estimation module to predict the global illuminance score of the image. On the other hand, we introduce a two-stage localization and complementation module in the decoding phase to transfer object localization cue and internal integrity cue in thermal features to the RGB modality.
arXiv Detail & Related papers (2022-10-09T13:50:12Z)
Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection [16.64781797503128]
RGB-thermal object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair. In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD. Experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-07T20:26:09Z)
Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection [73.31632581915201]
We propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction. A newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D.
arXiv Detail & Related papers (2020-08-07T10:13:05Z)
Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion. In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
Synergistic saliency and depth prediction for RGB-D saliency detection [76.27406945671379]
Existing RGB-D saliency datasets are small, which may lead to overfitting and limited generalization for diverse scenarios. We propose a semi-supervised system for RGB-D saliency detection that can be trained on smaller RGB-D saliency datasets without saliency ground truth.
arXiv Detail & Related papers (2020-07-03T14:24:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.