Related papers: PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images

PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images

URL: http://arxiv.org/abs/2312.08192v1
Date: Wed, 13 Dec 2023 14:57:28 GMT
Title: PAD: Self-Supervised Pre-Training with Patchwise-Scale Adapter for Infrared Images
Authors: Tao Zhang, Kun Ding, Jinyong Wen, Yu Xiong, Zeyu Zhang, Shiming Xiang, Chunhong Pan
Abstract summary: Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images. Non-iconic infrared images rendering common pre-training tasks less effective. The scarcity of fine-grained textures making it particularly challenging to learn general image features.
Score: 45.507517332100804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning (SSL) for RGB images has achieved significant success, yet there is still limited research on SSL for infrared images, primarily due to three prominent challenges: 1) the lack of a suitable large-scale infrared pre-training dataset, 2) the distinctiveness of non-iconic infrared images rendering common pre-training tasks like masked image modeling (MIM) less effective, and 3) the scarcity of fine-grained textures making it particularly challenging to learn general image features. To address these issues, we construct a Multi-Scene Infrared Pre-training (MSIP) dataset comprising 178,756 images, and introduce object-sensitive random RoI cropping, an image preprocessing method, to tackle the challenge posed by non-iconic images. To alleviate the impact of weak textures on feature learning, we propose a pre-training paradigm called Pre-training with ADapter (PAD), which uses adapters to learn domain-specific features while freezing parameters pre-trained on ImageNet to retain the general feature extraction capability. This new paradigm is applicable to any transformer-based SSL method. Furthermore, to achieve more flexible coordination between pre-trained and newly-learned features in different layers and patches, a patchwise-scale adapter with dynamically learnable scale factors is introduced. Extensive experiments on three downstream tasks show that PAD, with only 1.23M pre-trainable parameters, outperforms other baseline paradigms including continual full pre-training on MSIP. Our code and dataset are available at https://github.com/casiatao/PAD.

Related papers

Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning [34.310926877797584]
Fine-tuning pre-trained models on remote sensing (RS) images is a straightforward solution.<n>Existing methods apply parameter-efficient fine-tuning (PEFT) techniques, such as LoRA and AdaptFormer.<n>We propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning.
arXiv Detail & Related papers (2026-03-02T11:38:18Z)
Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening [7.37033839561317]
Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets.<n>We propose a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors.<n>The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings.
arXiv Detail & Related papers (2025-12-02T10:56:26Z)
IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection [92.56025546608699]
IrisNet is a novel meta-learned framework that adapts detection strategies to the input infrared image status.<n>Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters.<n> Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet.
arXiv Detail & Related papers (2025-11-25T13:53:54Z)
PanAdapter: Two-Stage Fine-Tuning with Spatial-Spectral Priors Injecting for Pansharpening [8.916207546866048]
We propose an efficient fine-tuning method, namely PanAdapter, to alleviate the issue of small-scale datasets in pansharpening tasks. We fine-tune the pre-trained CNN model and extract task-specific priors at two scales by proposed Local Prior Extraction (LPE) module. We demonstrate that our approach can benefit from pre-trained image restoration models and achieve state-of-the-art performance in several benchmark pansharpening datasets.
arXiv Detail & Related papers (2024-09-11T03:13:08Z)
Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples. For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge. We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z)
MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation [8.840744039764092]
We propose a single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design (CAD) model from a single RGB image. Our method, dubbed MRC-Net, comprises two stages. The first performs pose classification and renders the 3D object in the classified pose. The second stage performs regression to predict fine-grained residual pose within class.
arXiv Detail & Related papers (2024-03-12T18:36:59Z)
Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection [22.60228799622782]
Key bottleneck in object detection in IR images is lack of sufficient labeled training data. We seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality. We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting.
arXiv Detail & Related papers (2023-09-28T16:55:52Z)
Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z)
FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images. It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z)
Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light Images Without Task-Related Data [86.68013790656762]
We propose Noise SElf-Regression (NoiSER) without access to any task-related data. NoiSER is highly competitive in enhancement quality, yet with a much smaller model size, and much lower training and inference cost.
arXiv Detail & Related papers (2022-11-09T06:18:18Z)
P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting [94.11915008006483]
We propose a novel Point-to-Pixel prompting for point cloud analysis. Our method attains 89.3% accuracy on the hardest setting of ScanObjectNN. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Code.
arXiv Detail & Related papers (2022-08-04T17:59:03Z)
Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision. We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z)
Self-Supervised Representation Learning for RGB-D Salient Object Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts. For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)
Remote Sensing Image Scene Classification with Self-Supervised Paradigm under Limited Labeled Samples [11.025191332244919]
We introduce new self-supervised learning (SSL) mechanism to obtain the high-performance pre-training model for RSIs scene classification from large unlabeled data. Experiments on three commonly used RSIs scene classification datasets demonstrated that this new learning paradigm outperforms the traditional dominant ImageNet pre-trained model. The insights distilled from our studies can help to foster the development of SSL in the remote sensing community.
arXiv Detail & Related papers (2020-10-02T09:27:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.