RingMo-lite: A Remote Sensing Multi-task Lightweight Network with
CNN-Transformer Hybrid Framework
- URL: http://arxiv.org/abs/2309.09003v1
- Date: Sat, 16 Sep 2023 14:15:59 GMT
- Title: RingMo-lite: A Remote Sensing Multi-task Lightweight Network with
CNN-Transformer Hybrid Framework
- Authors: Yuelei Wang, Ting Zhang, Liangjin Zhao, Lin Hu, Zhechao Wang, Ziqing
Niu, Peirui Cheng, Kaiqiang Chen, Xuan Zeng, Zhirui Wang, Hongqi Wang and
Xian Sun
- Abstract summary: This paper proposes RingMo-lite, an RS multi-task lightweight network with a CNN-Transformer hybrid framework to optimize the interpretation process.
The proposed RingMo-lite reduces the parameters over 60% in various RS image interpretation tasks, the average accuracy drops by less than 2% in most of the scenes and achieves SOTA performance compared to models of the similar size.
- Score: 15.273362355253779
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, remote sensing (RS) vision foundation models such as RingMo
have emerged and achieved excellent performance in various downstream tasks.
However, the high demand for computing resources limits the application of
these models on edge devices. It is necessary to design a more lightweight
foundation model to support on-orbit RS image interpretation. Existing methods
face challenges in achieving lightweight solutions while retaining
generalization in RS image interpretation. This is due to the complex high and
low-frequency spectral components in RS images, which make traditional single
CNN or Vision Transformer methods unsuitable for the task. Therefore, this
paper proposes RingMo-lite, an RS multi-task lightweight network with a
CNN-Transformer hybrid framework, which effectively exploits the
frequency-domain properties of RS to optimize the interpretation process. It is
combined by the Transformer module as a low-pass filter to extract global
features of RS images through a dual-branch structure, and the CNN module as a
stacked high-pass filter to extract fine-grained details effectively.
Furthermore, in the pretraining stage, the designed frequency-domain masked
image modeling (FD-MIM) combines each image patch's high-frequency and
low-frequency characteristics, effectively capturing the latent feature
representation in RS data. As shown in Fig. 1, compared with RingMo, the
proposed RingMo-lite reduces the parameters over 60% in various RS image
interpretation tasks, the average accuracy drops by less than 2% in most of the
scenes and achieves SOTA performance compared to models of the similar size. In
addition, our work will be integrated into the MindSpore computing platform in
the near future.
Related papers
- OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset [66.15872913664407]
We present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach.
We curated a high-quality dataset named OpticalRS-4M by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication.
Experiments demonstrate that OpticalRS-4M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2 times.
arXiv Detail & Related papers (2024-06-17T15:41:57Z) - DVMSR: Distillated Vision Mamba for Efficient Super-Resolution [7.551130027327461]
We propose DVMSR, a novel lightweight Image SR network that incorporates Vision Mamba and a distillation strategy.
Our proposed DVMSR can outperform state-of-the-art efficient SR methods in terms of model parameters.
arXiv Detail & Related papers (2024-05-05T17:34:38Z) - Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring [25.36888929483233]
We propose a multi-scale network based on single-input and multiple-outputs(SIMO) for motion deblurring.
We combine the characteristics of real-world trajectories with a learnable wavelet transform module to focus on the directional continuity and frequency features of the step-by-step transitions between blurred images to sharp images.
arXiv Detail & Related papers (2023-12-29T02:59:40Z) - Spatially-Adaptive Feature Modulation for Efficient Image
Super-Resolution [90.16462805389943]
We develop a spatially-adaptive feature modulation (SAFM) mechanism upon a vision transformer (ViT)-like block.
Proposed method is $3times$ smaller than state-of-the-art efficient SR methods.
arXiv Detail & Related papers (2023-02-27T14:19:31Z) - Efficient Context Integration through Factorized Pyramidal Learning for
Ultra-Lightweight Semantic Segmentation [1.0499611180329804]
We propose a novel Factorized Pyramidal Learning (FPL) module to aggregate rich contextual information in an efficient manner.
We decompose the spatial pyramid into two stages which enables a simple and efficient feature fusion within the module to solve the notorious checkerboard effect.
Based on the FPL module and FIR unit, we propose an ultra-lightweight real-time network, called FPLNet, which achieves state-of-the-art accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-02-23T05:34:51Z) - Rolling Shutter Inversion: Bring Rolling Shutter Images to High
Framerate Global Shutter Video [111.08121952640766]
This paper presents a novel deep-learning based solution to the RS temporal super-resolution problem.
By leveraging the multi-view geometry relationship of the RS imaging process, our framework successfully achieves high framerate GS generation.
Our method can produce high-quality GS image sequences with rich details, outperforming the state-of-the-art methods.
arXiv Detail & Related papers (2022-10-06T16:47:12Z) - Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly.
A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work.
It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z) - Asymmetric CNN for image super-resolution [102.96131810686231]
Deep convolutional neural networks (CNNs) have been widely applied for low-level vision over the past five years.
We propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a mem?ory enhancement block (MEB) and a high-frequency feature enhancement block (HFFEB) for image super-resolution.
Our ACNet can effectively address single image super-resolution (SISR), blind SISR and blind SISR of blind noise problems.
arXiv Detail & Related papers (2021-03-25T07:10:46Z) - MPRNet: Multi-Path Residual Network for Lightweight Image Super
Resolution [2.3576437999036473]
A novel lightweight super resolution network is proposed, which improves the SOTA performance in lightweight SR.
The proposed architecture also contains a new attention mechanism, Two-Fold Attention Module, to maximize the representation ability of the model.
arXiv Detail & Related papers (2020-11-09T17:11:15Z) - Real Image Super Resolution Via Heterogeneous Model Ensemble using
GP-NAS [63.48801313087118]
We propose a new method for image superresolution using deep residual network with dense skip connections.
The proposed method won the first place in all three tracks of the AIM 2020 Real Image Super-Resolution Challenge.
arXiv Detail & Related papers (2020-09-02T22:33:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.