Related papers: SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

URL: http://arxiv.org/abs/2507.12845v1
Date: Thu, 17 Jul 2025 07:11:01 GMT
Title: SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning
Authors: Khang Truong, Lam Pham, Hieu Tang, Jasmin Lampert, Martin Boyer, Son Phan, Truong Nguyen,
Abstract summary: We present a transformer based network architecture for remote sensing image captioning (RSIC)<n>We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption.
Score: 2.2184293265652895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.

Related papers

MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model [23.383837540690823]
High-resolution (HR) remote sensing imagery plays a vital role in a wide range of applications, including urban planning and environmental monitoring.<n>Due to limitations in sensors and data transmission links, the images acquired in practice often suffer from resolution degradation.<n>Remote Sensing Image Super-Resolution (RSISR) aims to reconstruct HR images from low-resolution (LR) inputs, providing a cost-effective and efficient alternative to direct HR image acquisition.
arXiv Detail & Related papers (2025-05-29T02:38:34Z)
Vision Transformer Based Semantic Communications for Next Generation Wireless Networks [3.8095664680229935]
This paper presents a Vision Transformer (ViT)-based semantic communication framework.<n>By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content.<n>The architecture based on the proposed ViT network achieves the Peak Signal-versato-noise Ratio (PSNR) of 38 dB.
arXiv Detail & Related papers (2025-03-21T16:23:02Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing [16.755590790629153]
This review examines the development and application of multi-modal language models (MLLMs) in remote sensing. We focus on their ability to interpret and describe satellite imagery using natural language. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed.
arXiv Detail & Related papers (2024-11-05T12:14:22Z)
Trustworthy Image Semantic Communication with GenAI: Explainablity, Controllability, and Efficiency [59.15544887307901]
Image semantic communication (ISC) has garnered significant attention for its potential to achieve high efficiency in visual content transmission. Existing ISC systems based on joint source-channel coding face challenges in interpretability, operability, and compatibility. We propose a novel trustworthy ISC framework that employs Generative Artificial Intelligence (GenAI) for multiple downstream inference tasks.
arXiv Detail & Related papers (2024-08-07T14:32:36Z)
Efficient Visual State Space Model for Image Deblurring [99.54894198086852]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.<n>We propose a simple yet effective visual state space model (EVSSM) for image deblurring.<n>The proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images.
arXiv Detail & Related papers (2024-05-23T09:13:36Z)
RS-Mamba for Large Remote Sensing Image Dense Prediction [58.12667617617306]
We propose the Remote Sensing Mamba (RSM) for dense prediction tasks in large VHR remote sensing images. RSM is specifically designed to capture the global context of remote sensing images with linear complexity. Our model achieves better efficiency and accuracy than transformer-based models on large remote sensing images.
arXiv Detail & Related papers (2024-04-03T12:06:01Z)
Large Language Models for Captioning and Retrieving Remote Sensing Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks. It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z)
MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images. Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata. We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z)
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning [49.48946808024608]
We propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA. Specifically, the first stage involves preliminary alignment through image-text contrastive learning. In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model.
arXiv Detail & Related papers (2023-12-02T17:32:17Z)
An Empirical Study of Remote Sensing Pretraining [117.90699699469639]
We conduct an empirical study of remote sensing pretraining (RSP) on aerial images. RSP can help deliver distinctive performances in scene recognition tasks. RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, but it may still suffer from task discrepancies.
arXiv Detail & Related papers (2022-04-06T13:38:11Z)
Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z)
Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images [108.79667788962425]
salient object detection in optical remote sensing images (RSI-SOD) remains to be a challenging emerging topic. We propose a novel Multi-Content Complementation Network (MCCNet) to explore the complementarity of multiple content for RSI-SOD. In MCCM, we consider multiple types of features that are critical to RSI-SOD, including foreground features, edge features, background features, and global image-level features.
arXiv Detail & Related papers (2021-12-02T04:46:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.