Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
- URL: http://arxiv.org/abs/2410.17283v2
- Date: Thu, 14 Nov 2024 11:42:52 GMT
- Title: Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
- Authors: Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Kelu Yao, Chao Li, Xizhe Xue,
- Abstract summary: We review the fundamental theories related to visual language models (VLMs) and the datasets constructed for them in remote sensing.
We categorize the improvement methods into three main parts according to the core components ofVLMs and provide a detailed introduction and comparison of these methods.
- Score: 9.248637518957445
- License:
- Abstract: Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods. A project associated with this review has been created at https://github.com/taolijie11111/VLMs-in-RS-review.
Related papers
- VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI.
We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.
We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z) - From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models [56.9134620424985]
Cross-modal reasoning (CMR) is increasingly recognized as a crucial capability in the progression toward more sophisticated artificial intelligence systems.
The recent trend of deploying Large Language Models (LLMs) to tackle CMR tasks has marked a new mainstream of approaches for enhancing their effectiveness.
This survey offers a nuanced exposition of current methodologies applied in CMR using LLMs, classifying these into a detailed three-tiered taxonomy.
arXiv Detail & Related papers (2024-09-19T02:51:54Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - A Review of Multi-Modal Large Language and Vision Models [1.9685736810241874]
Large Language Models (LLMs) have emerged as a focal point of research and application.
Recently, LLMs have been extended into multi-modal large language models (MM-LLMs)
This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs.
arXiv Detail & Related papers (2024-03-28T15:53:45Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - RSGPT: A Remote Sensing Vision Language Model and Benchmark [7.279747655485913]
We build a high-quality Remote Sensing Image Captioning dataset (RSICap)
This dataset comprises 2,585 human-annotated captions with rich and high-quality information.
We also provide a benchmark evaluation dataset called RSIEval.
arXiv Detail & Related papers (2023-07-28T02:23:35Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs.
This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.