Large Language Models for Captioning and Retrieving Remote Sensing
  Images
        - URL: http://arxiv.org/abs/2402.06475v1
- Date: Fri, 9 Feb 2024 15:31:01 GMT
- Title: Large Language Models for Captioning and Retrieving Remote Sensing
  Images
- Authors: Jo\~ao Daniel Silva and Jo\~ao Magalh\~aes and Devis Tuia and Bruno
  Martins
- Abstract summary: RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
- Score: 4.499596985198142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Image captioning and cross-modal retrieval are examples of tasks that involve
the joint analysis of visual and linguistic information. In connection to
remote sensing imagery, these tasks can help non-expert users in extracting
relevant Earth observation information for a variety of applications. Still,
despite some previous efforts, the development and application of vision and
language models to the remote sensing domain have been hindered by the
relatively small size of the available datasets and models used in previous
studies. In this work, we propose RS-CapRet, a Vision and Language method for
remote sensing tasks, in particular image captioning and text-image retrieval.
We specifically propose to use a highly capable large decoder language model
together with image encoders adapted to remote sensing imagery through
contrastive language-image pre-training. To bridge together the image encoder
and language decoder, we propose training simple linear layers with examples
from combining different remote sensing image captioning datasets, keeping the
other parameters frozen. RS-CapRet can then generate descriptions for remote
sensing images and retrieve images from textual descriptions, achieving SOTA or
competitive performance with existing methods. Qualitative results illustrate
that RS-CapRet can effectively leverage the pre-trained large language model to
describe remote sensing images, retrieve them based on different types of
queries, and also show the ability to process interleaved sequences of images
and text in a dialogue manner.
 
      
        Related papers
        - Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
 Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
 arXiv  Detail & Related papers  (2024-10-30T18:13:11Z)
- RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with   Rich Linguistic Semantics from Openly Available Data and Large Language   Models [3.178739428363249]
 We propose a workflow to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform.
Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions.
 arXiv  Detail & Related papers  (2024-08-27T02:45:26Z)
- Towards a multimodal framework for remote sensing image change retrieval   and captioning [3.3488510654648453]
 We propose a novel foundation model for bi-temporal RS image pairs, in the context of change detection analysis.
By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities, in the context of bi-temporal change detection.
 arXiv  Detail & Related papers  (2024-06-19T10:30:56Z)
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
 This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
 arXiv  Detail & Related papers  (2024-05-21T18:02:07Z)
- Knowledge-aware Text-Image Retrieval for Remote Sensing Images [6.4527372338977]
 Cross-modal text-image retrieval often suffers from information asymmetry between texts and images.
By mining relevant information from an external knowledge graph, we propose a Knowledge-aware Text-Image Retrieval.
We show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.
 arXiv  Detail & Related papers  (2024-05-06T11:27:27Z)
- Remote Sensing Vision-Language Foundation Models without Annotations via
  Ground Remote Alignment [61.769441954135246]
 We introduce a method to train vision-language models for remote-sensing images without using any textual annotations.
Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language.
 arXiv  Detail & Related papers  (2023-12-12T03:39:07Z)
- GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
 We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
 Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
 arXiv  Detail & Related papers  (2023-11-24T18:59:10Z)
- Towards Automatic Satellite Images Captions Generation Using Large
  Language Models [0.5439020425819]
 We propose Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images.
We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images.
 arXiv  Detail & Related papers  (2023-10-17T16:45:47Z)
- Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
 We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
 arXiv  Detail & Related papers  (2023-05-29T17:50:33Z)
- Generating Images with Multimodal Language Models [78.6660334861137]
 We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
 arXiv  Detail & Related papers  (2023-05-26T19:22:03Z)
- Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
 We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
 Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
 arXiv  Detail & Related papers  (2022-07-26T19:35:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.