K2: A Foundation Language Model for Geoscience Knowledge Understanding
and Utilization
- URL: http://arxiv.org/abs/2306.05064v2
- Date: Wed, 13 Sep 2023 19:33:18 GMT
- Title: K2: A Foundation Language Model for Geoscience Knowledge Understanding
and Utilization
- Authors: Cheng Deng, Tianhang Zhang, Zhongmou He, Yi Xu, Qiyuan Chen, Yuanyuan
Shi, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, Junxian
He
- Abstract summary: Large language models (LLMs) have achieved great success in general domains of natural language processing.
We present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience.
- Score: 105.89544876731942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved great success in general domains
of natural language processing. In this paper, we bring LLMs to the realm of
geoscience with the objective of advancing research and applications in this
field. To this end, we present the first-ever LLM in geoscience, K2, alongside
a suite of resources developed to further promote LLM research within
geoscience. For instance, we have curated the first geoscience instruction
tuning dataset, GeoSignal, which aims to align LLM responses to
geoscience-related user queries. Additionally, we have established the first
geoscience benchmark, GeoBench, to evaluate LLMs in the context of geoscience.
In this work, we experiment with a complete recipe to adapt a pre-trained
general-domain LLM to the geoscience domain. Specifically, we further train the
LLaMA-7B model on 5.5B tokens of geoscience text corpus, including over 1
million pieces of geoscience literature, and utilize GeoSignal's supervised
data to fine-tune the model. Moreover, we share a protocol that can efficiently
gather domain-specific data and construct domain-supervised data, even in
situations where manpower is scarce. Meanwhile, we equip K2 with the abilities
of using tools to be a naive geoscience aide. Experiments conducted on the
GeoBench demonstrate the effectiveness of our approach and datasets on
geoscience knowledge understanding and utilization.We open-source all the
training data and K2 model checkpoints at https://github.com/davendw49/k2.
Related papers
- Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework [59.42946541163632]
We introduce a comprehensive geolocation framework with three key components.
GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric.
We demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.
arXiv Detail & Related papers (2025-02-19T14:21:25Z) - PEACE: Empowering Geologic Map Holistic Understanding with MLLMs [64.58959634712215]
Geologic map, as a fundamental diagram in geology science, provides critical insights into the structure and composition of Earth's subsurface and surface.
Despite their significance, current Multimodal Large Language Models (MLLMs) often fall short in geologic map understanding.
To quantify this gap, we construct GeoMap-Bench, the first-ever benchmark for evaluating MLLMs in geologic map understanding.
arXiv Detail & Related papers (2025-01-10T18:59:42Z) - GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - GeoLM: Empowering Language Models for Geospatially Grounded Language
Understanding [45.36562604939258]
This paper introduces GeoLM, a language model that enhances the understanding of geo-entities in natural language.
We demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing.
arXiv Detail & Related papers (2023-10-23T01:20:01Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - Are Large Language Models Geospatially Knowledgeable? [21.401931052512595]
This paper investigates the extent of geospatial knowledge, awareness, and reasoning abilities encoded within Large Language Models (LLM)
With a focus on autoregressive language models, we devise experimental approaches related to (i) probing LLMs for geo-coordinates to assess geospatial knowledge, (ii) using geospatial and non-geospatial prepositions to gauge their geospatial awareness, and (iii) utilizing a multidimensional scaling (MDS) experiment to assess the models' geospatial reasoning capabilities.
arXiv Detail & Related papers (2023-10-09T17:20:11Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - Geographic Adaptation of Pretrained Language Models [29.81557992080902]
We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup.
We show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the pretrained language models.
arXiv Detail & Related papers (2022-03-16T11:55:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.