Related papers: Towards Vision-Language Geo-Foundation Model: A Survey

Towards Vision-Language Geo-Foundation Model: A Survey

URL: http://arxiv.org/abs/2406.09385v1
Date: Thu, 13 Jun 2024 17:57:30 GMT
Title: Towards Vision-Language Geo-Foundation Model: A Survey
Authors: Yue Zhou, Litong Feng, Yiping Ke, Xue Jiang, Junchi Yan, Xue Yang, Wayne Zhang,
Abstract summary: Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks. This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
Score: 65.70547895998541
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.

Related papers

OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence. We propose a MLLM (OmniGeo) tailored to geospatial applications. By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z)
GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models [27.878058177228727]
GeoLangBind is a novel agglomerative vision--language foundation model. It bridges the gap between heterogeneous EO data modalities using language as a unifying medium. Our approach aligns different EO data types into a shared language embedding space.
arXiv Detail & Related papers (2025-03-08T19:10:04Z)
LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model [27.047809869136458]
Graph foundation models (GFMs) have recently gained significant attention. Current research tends to focus on specific subsets of graph learning tasks. We propose GFMBench-a systematic and comprehensive benchmark comprising 26 datasets. We also introduce LangGFM, a novel GFM that relies entirely on large language models.
arXiv Detail & Related papers (2024-10-19T03:27:19Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models. The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information. In this study, we find that the intermediate layers of models can encode more global semantic information. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z)
Position: Graph Foundation Models are Already Here [53.737868336014735]
Graph Foundation Models (GFMs) are emerging as a significant research topic in the graph domain. We propose a novel perspective for the GFM development by advocating for a graph vocabulary'' This perspective can potentially advance the future GFM design in line with the neural scaling laws.
arXiv Detail & Related papers (2024-02-03T17:24:36Z)
On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications [38.416917485939486]
This paper explores the capabilities of GPT-4V in the realms of geography, environmental science, agriculture, and urban planning. Data sources include satellite imagery, aerial photos, ground-level images, field images, and public datasets. The model is evaluated on a series of tasks including geo-localization, textual data extraction from maps, remote sensing image classification, visual question answering, crop type identification, disease/pest/weed recognition, chicken behavior analysis, agricultural object counting, urban planning knowledge question answering, and plan generation.
arXiv Detail & Related papers (2023-12-23T22:36:58Z)
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese. We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z)
City Foundation Models for Learning General Purpose Representations from OpenStreetMap [16.09047066527081]
We present CityFM, a framework to train a foundation model within a selected geographical area of interest, such as a city. CityFM relies solely on open data from OpenStreetMap, and produces multimodal representations of entities of different types, spatial, visual, and textual information. In all the experiments, CityFM achieves performance superior to, or on par with, the baselines.
arXiv Detail & Related papers (2023-10-01T05:55:30Z)
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence [39.86997089245117]
Foundations models (FMs) can be adapted to a wide range of downstream tasks by fine-tuning, few-shot, or zero-shot learning. We propose that one of the major challenges of developing a FM for GeoAI is to address the multimodality nature of geospatial tasks.
arXiv Detail & Related papers (2023-04-13T19:50:17Z)
A General Purpose Neural Architecture for Geospatial Systems [142.43454584836812]
We present a roadmap towards the construction of a general-purpose neural architecture (GPNA) with a geospatial inductive bias. We envision how such a model may facilitate cooperation between members of the community.
arXiv Detail & Related papers (2022-11-04T09:58:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.