Vision-Language Models in Remote Sensing: Current Progress and Future Trends
- URL: http://arxiv.org/abs/2305.05726v2
- Date: Tue, 2 Apr 2024 09:52:41 GMT
- Title: Vision-Language Models in Remote Sensing: Current Progress and Future Trends
- Authors: Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, Xiao Xiang Zhu,
- Abstract summary: Vision-language models enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics.
Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image.
This paper provides a comprehensive review of the research on vision-language models in remote sensing.
- Score: 25.017685538386548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide intelligent solutions close to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in remote sensing primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond visual recognition of RS images, model semantic relationships, and generate natural language descriptions of the image. This makes them better suited for tasks requiring visual and textual understanding, such as image captioning, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting challenges, and identifying potential research opportunities.
Related papers
- ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers [1.6541870997607049]
We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers.
ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution.
We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
arXiv Detail & Related papers (2024-08-12T10:15:13Z) - Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning.
We conduct a comprehensive evaluation of competitive language and vision-language models.
Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z) - Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images.
Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z) - Trends, Applications, and Challenges in Human Attention Modelling [65.61554471033844]
Human attention modelling has proven to be particularly useful for understanding the cognitive processes underlying visual exploration.
It provides support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling.
arXiv Detail & Related papers (2024-02-28T19:35:30Z) - Vision-Language Navigation with Embodied Intelligence: A Survey [19.049590467248255]
Vision-language navigation (VLN) is a critical research path to achieve embodied intelligence.
VLN integrates artificial intelligence, natural language processing, computer vision, and robotics.
This survey systematically reviews the research progress of VLN and details the research direction of VLN with embodied intelligence.
arXiv Detail & Related papers (2024-02-22T05:45:17Z) - Analyzing the Roles of Language and Vision in Learning from Limited Data [31.895396236504993]
We study the contributions that language and vision make to learning about the world.
We find that a language model leveraging all components recovers a majority of a Vision-Language Model's performance.
arXiv Detail & Related papers (2024-02-15T22:19:41Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.