SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning
- URL: http://arxiv.org/abs/2006.08432v2
- Date: Tue, 13 Oct 2020 10:09:15 GMT
- Title: SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning
- Authors: Gencer Sumbul, Sonali Nayak, Beg\"um Demir
- Abstract summary: We present a novel Summarization Driven Remote Sensing Image Captioning (SD-RSIC) approach.
The proposed approach consists of three main steps. The first step obtains the standard image captions by jointly exploiting convolutional neural networks (CNNs) with long short-term memory (LSTM) networks.
The second step summarizes the ground-truth captions of each training image into a single caption by exploiting sequence to sequence neural networks and eliminates the redundancy present in the training set.
The third step automatically defines the adaptive weights associated to each RS image to combine the standard captions with the summarized captions based on the semantic
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have been recently found popular for image
captioning problems in remote sensing (RS). Existing DNN based approaches rely
on the availability of a training set made up of a high number of RS images
with their captions. However, captions of training images may contain redundant
information (they can be repetitive or semantically similar to each other),
resulting in information deficiency while learning a mapping from the image
domain to the language domain. To overcome this limitation, in this paper, we
present a novel Summarization Driven Remote Sensing Image Captioning (SD-RSIC)
approach. The proposed approach consists of three main steps. The first step
obtains the standard image captions by jointly exploiting convolutional neural
networks (CNNs) with long short-term memory (LSTM) networks. The second step,
unlike the existing RS image captioning methods, summarizes the ground-truth
captions of each training image into a single caption by exploiting sequence to
sequence neural networks and eliminates the redundancy present in the training
set. The third step automatically defines the adaptive weights associated to
each RS image to combine the standard captions with the summarized captions
based on the semantic content of the image. This is achieved by a novel
adaptive weighting strategy defined in the context of LSTM networks.
Experimental results obtained on the RSCID, UCM-Captions and Sydney-Captions
datasets show the effectiveness of the proposed approach compared to the
state-of-the-art RS image captioning approaches. The code of the proposed
approach is publicly available at
https://gitlab.tubit.tu-berlin.de/rsim/SD-RSIC.
Related papers
- Decoding fMRI Data into Captions using Prefix Language Modeling [3.4328283704703866]
We present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal.
We also explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.
arXiv Detail & Related papers (2025-01-05T15:06:25Z) - Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [70.98890307376548]
We propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents during training.
Our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning.
arXiv Detail & Related papers (2024-12-31T13:39:08Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - A Novel Triplet Sampling Method for Multi-Label Remote Sensing Image
Search and Retrieval [1.123376893295777]
A common approach for learning the metric space relies on the selection of triplets of similar (positive) and dissimilar (negative) images.
We propose a novel triplet sampling method in the framework of deep neural networks (DNNs) defined for multi-label RS CBIR problems.
arXiv Detail & Related papers (2021-05-08T09:16:09Z) - A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning [32.11006090613004]
We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning.
We introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN.
We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases.
arXiv Detail & Related papers (2020-10-05T13:35:02Z) - Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent
Experts [5.859294565508523]
A new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence.
We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling.
arXiv Detail & Related papers (2020-07-07T11:00:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.