Related papers: Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images

Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images

URL: http://arxiv.org/abs/2506.13307v2
Date: Thu, 14 Aug 2025 16:29:14 GMT
Title: Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images
Authors: Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Olivier Lévêque, Elise Colin,
Abstract summary: We adapt an open-source text-to-image foundation model to the Synthetic Aperture Radar (SAR) modality.<n>We compare full fine-tuning and parameter-efficient Low-Rank Adaptation (LoRA) across the UNet diffusion backbone, the Variational Autoencoder (VAE) and the text encoders.<n>Our results show that a hybrid strategy-full UNet tuning with LoRA on the text encoders and a learned token embedding-best preserves SAR geometry and texture.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a framework for adapting a large pretrained latent diffusion model to high-resolution Synthetic Aperture Radar (SAR) image generation. The approach enables controllable synthesis and the creation of rare or out-of-distribution scenes beyond the training set. Rather than training a task-specific small model from scratch, we adapt an open-source text-to-image foundation model to the SAR modality, using its semantic prior to align prompts with SAR imaging physics (side-looking geometry, slant-range projection, and coherent speckle with heavy-tailed statistics). Using a 100k-image SAR dataset, we compare full fine-tuning and parameter-efficient Low-Rank Adaptation (LoRA) across the UNet diffusion backbone, the Variational Autoencoder (VAE), and the text encoders. Evaluation combines (i) statistical distances to real SAR amplitude distributions, (ii) textural similarity via Gray-Level Co-occurrence Matrix (GLCM) descriptors, and (iii) semantic alignment using a SAR-specialized CLIP model. Our results show that a hybrid strategy-full UNet tuning with LoRA on the text encoders and a learned token embedding-best preserves SAR geometry and texture while maintaining prompt fidelity. The framework supports text-based control and multimodal conditioning (e.g., segmentation maps, TerraSAR-X, or optical guidance), opening new paths for large-scale SAR scene data augmentation and unseen scenario simulation in Earth observation.

Related papers

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders [74.72147962028265]
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet.<n>We investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation.
arXiv Detail & Related papers (2026-01-22T18:58:16Z)
Visual Autoregressive Modelling for Monocular Depth Estimation [69.01449528371916]
We propose a monocular depth estimation method based on visual autoregressive ( VAR) priors.<n>Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism.<n>We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets.
arXiv Detail & Related papers (2025-12-27T17:08:03Z)
GDROS: A Geometry-Guided Dense Registration Framework for Optical-SAR Images under Large Geometric Transformations [24.22541638346487]
We propose GDROS, a geometry-guided dense registration framework leveraging global cross-modal image interactions.<n>First, we extract cross-modal deep features from optical and SAR images through a CNN-Transformer hybrid feature extraction module.<n>We then implement a least squares regression (LSR) module to geometrically constrain the predicted dense optical flow field.
arXiv Detail & Related papers (2025-11-01T15:40:34Z)
Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z)
Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images [51.74614065919118]
This paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images.<n>We propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features.<n>We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features.
arXiv Detail & Related papers (2025-08-25T14:22:57Z)
Knowledge-guided Complex Diffusion Model for PolSAR Image Classification in Contourlet Domain [58.46450049579116]
We propose a knowledge-guided complex diffusion model for PolSAR image classification in the Contourlet domain.<n> Specifically, the Contourlet transform is first applied to decompose the data into low- and high-frequency subbands.<n>A knowledge-guided complex diffusion network is then designed to model the statistical properties of the low-frequency components.
arXiv Detail & Related papers (2025-07-08T04:50:28Z)
Dataset Distillation with Probabilistic Latent Features [9.318549327568695]
A compact set of synthetic data can effectively replace the original dataset in downstream classification tasks.<n>We propose a novel approach that models the joint distribution of latent features.<n>Our method achieves state-of-the-art cross architecture performance across a range of backbone architectures.
arXiv Detail & Related papers (2025-05-10T13:53:49Z)
PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching [15.840638449527399]
We propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts.<n>PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models.<n>Experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods.
arXiv Detail & Related papers (2025-02-25T11:19:26Z)
Conditional Brownian Bridge Diffusion Model for VHR SAR to Optical Image Translation [5.578820789388206]
This letter introduces a conditional image-to-image translation approach based on Brownian Bridge Diffusion Model (BBDM)<n>We conducted comprehensive experiments on the MSAW dataset, a paired SAR and optical images collection of 0.5m Very-High-Resolution (VHR)
arXiv Detail & Related papers (2024-08-15T05:43:46Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Diversified in-domain synthesis with efficient fine-tuning for few-shot classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z)
SatDM: Synthesizing Realistic Satellite Image with Semantic Layout Conditioning using Diffusion Models [0.0]
Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated significant promise in synthesizing realistic images from semantic layouts. In this paper, a conditional DDPM model capable of taking a semantic map and generating high-quality, diverse, and correspondingly accurate satellite images is implemented. The effectiveness of our proposed model is validated using a meticulously labeled dataset introduced within the context of this study.
arXiv Detail & Related papers (2023-09-28T19:39:13Z)
ExposureDiffusion: Learning to Expose for Low-light Image Enhancement [87.08496758469835]
This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks.
arXiv Detail & Related papers (2023-07-15T04:48:35Z)
Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets. It considers a retrieval-then-optimization procedure to synthesize pseudo text features. It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
A Feature Fusion-Net Using Deep Spatial Context Encoder and Nonstationary Joint Statistical Model for High Resolution SAR Image Classification [10.152675581771113]
A novel end-to-end supervised classification method is proposed for HR SAR images. To extract more effective spatial features, a new deep spatial context encoder network (DSCEN) is proposed. To enhance the diversity of statistics, the nonstationary joint statistical model (NS-JSM) is adopted to form the global statistical features.
arXiv Detail & Related papers (2021-05-11T06:20:14Z)
Sparse Signal Models for Data Augmentation in Deep Learning ATR [0.8999056386710496]
We propose a data augmentation approach to incorporate domain knowledge and improve the generalization power of a data-intensive learning algorithm. We exploit the sparsity of the scattering centers in the spatial domain and the smoothly-varying structure of the scattering coefficients in the azimuthal domain to solve the ill-posed problem of over-parametrized model fitting.
arXiv Detail & Related papers (2020-12-16T21:46:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.