IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
- URL: http://arxiv.org/abs/2507.14449v1
- Date: Sat, 19 Jul 2025 02:53:01 GMT
- Title: IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark
- Authors: Zhe Cao, Jin Zhang, Ruiheng Zhang,
- Abstract summary: We propose IRGPT, the first multi-modal large language model for real-world infrared images.<n>The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts.<n>IRGPT achieves state-of-the-art performance even compared with larger-scale models.
- Score: 6.171775609352536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.
Related papers
- MTSIC: Multi-stage Transformer-based GAN for Spectral Infrared Image Colorization [26.33768545616346]
Existing colorization methods rely on single-band images with limited spectral information and insufficient feature extraction capabilities.<n>In this paper, we propose a generative adversarial network (GAN)-based framework designed to integrate spectral information to enhance the colorization of infrared images.<n> Experimental results demonstrate that the proposed method significantly outperforms traditional techniques and effectively enhances the visual quality of infrared images.
arXiv Detail & Related papers (2025-06-21T01:42:25Z) - TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion [55.34830989105704]
Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities.<n>We introduce textual semantics at two levels: the mask semantic level and the text semantic level.<n>We propose Textual Semantic Guidance for infrared and visible image fusion, which guides the image synthesis process.
arXiv Detail & Related papers (2025-06-20T03:53:07Z) - DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding [43.85632218045282]
We introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM)<n>PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength.<n>VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions.
arXiv Detail & Related papers (2025-03-24T17:58:09Z) - Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z) - Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes [3.399048100638418]
We introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD.<n>We propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images.<n>In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT.
arXiv Detail & Related papers (2025-03-10T12:33:07Z) - Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z) - Contourlet Refinement Gate Framework for Thermal Spectrum Distribution Regularized Infrared Image Super-Resolution [54.293362972473595]
Image super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts.
Current approaches to address SR tasks are either dedicated to extracting RGB image features or assuming similar degradation patterns.
We propose a Contourlet refinement gate framework to restore infrared modal-specific features while preserving spectral distribution fidelity.
arXiv Detail & Related papers (2024-11-19T14:24:03Z) - GAN-HA: A generative adversarial network with a novel heterogeneous dual-discriminator network and a new attention-based fusion strategy for infrared and visible image fusion [0.1160897408844138]
Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images.
Existing dual-discriminator generative adversarial networks (GANs) often rely on two structurally identical discriminators for learning.
This paper proposes a novel GAN with a heterogeneous dual-discriminator network and an attention-based fusion strategy.
arXiv Detail & Related papers (2024-04-24T17:06:52Z) - Unsupervised Misaligned Infrared and Visible Image Fusion via
Cross-Modality Image Generation and Registration [59.02821429555375]
We present a robust cross-modality generation-registration paradigm for unsupervised misaligned infrared and visible image fusion.
To better fuse the registered infrared images and visible images, we present a feature Interaction Fusion Module (IFM)
arXiv Detail & Related papers (2022-05-24T07:51:57Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.