A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
- URL: http://arxiv.org/abs/2506.09429v1
- Date: Wed, 11 Jun 2025 06:24:02 GMT
- Title: A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning
- Authors: Swadhin Das, Divyansh Mundra, Priyanshu Dayal, Raksha Sharma,
- Abstract summary: A lightweight transformer architecture is proposed to reduce the dimensionality of the encoder layers and employ a distilled version of GPT-2 as the decoder.<n>A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network.<n> Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.
- Score: 0.12499537119440242
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformer-based models have achieved strong performance in remote sensing image captioning by capturing long-range dependencies and contextual information. However, their practical deployment is hindered by high computational costs, especially in multi-modal frameworks that employ separate transformer-based encoders and decoders. In addition, existing remote sensing image captioning models primarily focus on high-level semantic extraction while often overlooking fine-grained structural features such as edges, contours, and object boundaries. To address these challenges, a lightweight transformer architecture is proposed by reducing the dimensionality of the encoder layers and employing a distilled version of GPT-2 as the decoder. A knowledge distillation strategy is used to transfer knowledge from a more complex teacher model to improve the performance of the lightweight network. Furthermore, an edge-aware enhancement strategy is incorporated to enhance image representation and object boundary understanding, enabling the model to capture fine-grained spatial details in remote sensing images. Experimental results demonstrate that the proposed approach significantly improves caption quality compared to state-of-the-art methods.
Related papers
- Frequency-Domain Fusion Transformer for Image Inpainting [6.4194162137514725]
This paper proposes a Transformer-based image inpainting method incorporating frequency-domain fusion.<n> Experimental results demonstrate that the proposed method effectively improves the quality of image inpainting by preserving more high-frequency information.
arXiv Detail & Related papers (2025-06-23T09:19:04Z) - WiTUnet: A U-Shaped Architecture Integrating CNN and Transformer for Improved Feature Alignment and Local Information Fusion [16.41082757280262]
Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT.
In this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections.
arXiv Detail & Related papers (2024-04-15T07:53:07Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - Effective Image Tampering Localization via Enhanced Transformer and
Co-attention Fusion [5.691973573807887]
We propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder.
The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module.
arXiv Detail & Related papers (2023-09-17T15:43:06Z) - AICT: An Adaptive Image Compression Transformer [18.05997169440533]
We propose a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT)
The proposed ICT can capture both global and local contexts from the latent representations.
We leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation.
arXiv Detail & Related papers (2023-07-12T11:32:02Z) - Unsupervised Structure-Consistent Image-to-Image Translation [6.282068591820945]
The Swapping Autoencoder achieved state-of-the-art performance in deep image manipulation and image-to-image translation.
We improve this work by introducing a simple yet effective auxiliary module based on gradient reversal layers.
The auxiliary module's loss forces the generator to learn to reconstruct an image with an all-zero texture code.
arXiv Detail & Related papers (2022-08-24T13:47:15Z) - CM-GAN: Image Inpainting with Cascaded Modulation GAN and Object-Aware
Training [112.96224800952724]
We propose cascaded modulation GAN (CM-GAN) to generate plausible image structures when dealing with large holes in complex images.
In each decoder block, global modulation is first applied to perform coarse semantic-aware synthesis structure, then spatial modulation is applied on the output of global modulation to further adjust the feature map in a spatially adaptive fashion.
In addition, we design an object-aware training scheme to prevent the network from hallucinating new objects inside holes, fulfilling the needs of object removal tasks in real-world scenarios.
arXiv Detail & Related papers (2022-03-22T16:13:27Z) - Transformer-based SAR Image Despeckling [53.99620005035804]
We introduce a transformer-based network for SAR image despeckling.
The proposed despeckling network comprises of a transformer-based encoder which allows the network to learn global dependencies between different image regions.
Experiments show that the proposed method achieves significant improvements over traditional and convolutional neural network-based despeckling methods.
arXiv Detail & Related papers (2022-01-23T20:09:01Z) - Spatially-Adaptive Image Restoration using Distortion-Guided Networks [51.89245800461537]
We present a learning-based solution for restoring images suffering from spatially-varying degradations.
We propose SPAIR, a network design that harnesses distortion-localization information and dynamically adjusts to difficult regions in the image.
arXiv Detail & Related papers (2021-08-19T11:02:25Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.