Related papers: Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

URL: http://arxiv.org/abs/2601.04381v1
Date: Wed, 07 Jan 2026 20:41:26 GMT
Title: Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection
Authors: Maxim Clouser, Kia Khezeli, John Kalantari,
Abstract summary: Foundation models for vision are predominantly trained on RGB data.<n>Many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR)<n>We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator.
Score: 0.726437825413781
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.

Related papers

WiSE-OD: Benchmarking Robustness in Infrared Object Detection [12.115815831689265]
WiSE-OD is a weight-space ensembling method with two variants: WiSE-OD$_ZS$, which combines RGB zero-shot and IR fine-tuned weights, and WiSE-OD$_LP$, which blends zero-shot and linear probing.<n>We introduce LLVIP-C and FLIR-C, two cross-modality out-of-distribution benchmarks built by applying corruption to standard IR datasets.
arXiv Detail & Related papers (2025-07-25T03:33:50Z)
End-to-End RGB-IR Joint Image Compression With Channel-wise Cross-modality Entropy Model [39.52468600966148]
As the number of modalities increases, the required data storage and transmission costs also double.<n>This work proposes a joint compression framework for RGB-IR image pair.
arXiv Detail & Related papers (2025-06-27T02:04:21Z)
Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z)
Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z)
VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition [54.27379947727035]
This paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification.<n>The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network.<n>The source code and pre-trained models will be released on urlhttps://github.com/Event-AHU/VELoRA.
arXiv Detail & Related papers (2024-12-28T07:38:23Z)
The Solution for the GAIIC2024 RGB-TIR object detection Challenge [5.625794757504552]
RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively.
arXiv Detail & Related papers (2024-07-04T12:08:36Z)
UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning [34.727262809777095]
We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.<n>Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.<n> Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z)
Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection [22.60228799622782]
Key bottleneck in object detection in IR images is lack of sufficient labeled training data. We seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality. We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting.
arXiv Detail & Related papers (2023-09-28T16:55:52Z)
DiffIR: Efficient Diffusion Model for Image Restoration [108.82579440308267]
Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network. Traditional DMs running massive iterations on a large model to estimate whole images or feature maps is inefficient for image restoration. We propose DiffIR, which consists of a compact IR prior extraction network (CPEN), dynamic IR transformer (DIRformer), and denoising network.
arXiv Detail & Related papers (2023-03-16T16:47:14Z)
Assessing thermal imagery integration into object detection methods on ground-based and air-based collection platforms [0.0]
fusing RGB with thermal long wave infrared (LWIR) images to increase the performance of object detection machine learning (ML) models. Ground-based blended RGB-LWIR model exhibited superior performance compared to the RGB or LWIR approaches, achieving a mAP of 98.4%. This research additionally contributes a novel labelled training dataset of 12,600 images for RGB, LWIR, and RGB-LWIR fused imagery, collected from ground-based and air-based platforms.
arXiv Detail & Related papers (2022-12-23T23:51:53Z)
Self-Supervised Representation Learning for RGB-D Salient Object Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts. For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)
DUT-LFSaliency: Versatile Dataset and Light Field-to-RGB Saliency Detection [104.50425501764806]
We introduce a large-scale dataset to enable versatile applications for light field saliency detection. We present an asymmetrical two-stream model consisting of the Focal stream and RGB stream. Experiments demonstrate that our Focal stream achieves state-of-the-arts performance.
arXiv Detail & Related papers (2020-12-30T11:53:27Z)
Synergistic saliency and depth prediction for RGB-D saliency detection [76.27406945671379]
Existing RGB-D saliency datasets are small, which may lead to overfitting and limited generalization for diverse scenarios. We propose a semi-supervised system for RGB-D saliency detection that can be trained on smaller RGB-D saliency datasets without saliency ground truth.
arXiv Detail & Related papers (2020-07-03T14:24:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.