K2: A Foundation Language Model for Geoscience Knowledge Understanding
and Utilization
- URL: http://arxiv.org/abs/2306.05064v2
- Date: Wed, 13 Sep 2023 19:33:18 GMT
- Title: K2: A Foundation Language Model for Geoscience Knowledge Understanding
and Utilization
- Authors: Cheng Deng, Tianhang Zhang, Zhongmou He, Yi Xu, Qiyuan Chen, Yuanyuan
Shi, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, Junxian
He
- Abstract summary: Large language models (LLMs) have achieved great success in general domains of natural language processing.
We present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience.
- Score: 105.89544876731942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have achieved great success in general domains
of natural language processing. In this paper, we bring LLMs to the realm of
geoscience with the objective of advancing research and applications in this
field. To this end, we present the first-ever LLM in geoscience, K2, alongside
a suite of resources developed to further promote LLM research within
geoscience. For instance, we have curated the first geoscience instruction
tuning dataset, GeoSignal, which aims to align LLM responses to
geoscience-related user queries. Additionally, we have established the first
geoscience benchmark, GeoBench, to evaluate LLMs in the context of geoscience.
In this work, we experiment with a complete recipe to adapt a pre-trained
general-domain LLM to the geoscience domain. Specifically, we further train the
LLaMA-7B model on 5.5B tokens of geoscience text corpus, including over 1
million pieces of geoscience literature, and utilize GeoSignal's supervised
data to fine-tune the model. Moreover, we share a protocol that can efficiently
gather domain-specific data and construct domain-supervised data, even in
situations where manpower is scarce. Meanwhile, we equip K2 with the abilities
of using tools to be a naive geoscience aide. Experiments conducted on the
GeoBench demonstrate the effectiveness of our approach and datasets on
geoscience knowledge understanding and utilization.We open-source all the
training data and K2 model checkpoints at https://github.com/davendw49/k2.
Related papers
- GeoGalactica: A Scientific Large Language Model in Geoscience [95.15911521220052]
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP)
We specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset.
We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus.
Then we fine-tune the model with 1 million pairs of instruction-tuning
arXiv Detail & Related papers (2023-12-31T09:22:54Z) - GeoLM: Empowering Language Models for Geospatially Grounded Language
Understanding [45.36562604939258]
This paper introduces GeoLM, a language model that enhances the understanding of geo-entities in natural language.
We demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing.
arXiv Detail & Related papers (2023-10-23T01:20:01Z) - GeoLLM: Extracting Geospatial Knowledge from Large Language Models [49.20315582673223]
We present GeoLLM, a novel method that can effectively extract geospatial knowledge from large language models.
We demonstrate the utility of our approach across multiple tasks of central interest to the international community, including the measurement of population density and economic livelihoods.
Our experiments reveal that LLMs are remarkably sample-efficient, rich in geospatial information, and robust across the globe.
arXiv Detail & Related papers (2023-10-10T00:03:23Z) - Are Large Language Models Geospatially Knowledgeable? [21.401931052512595]
This paper investigates the extent of geospatial knowledge, awareness, and reasoning abilities encoded within Large Language Models (LLM)
With a focus on autoregressive language models, we devise experimental approaches related to (i) probing LLMs for geo-coordinates to assess geospatial knowledge, (ii) using geospatial and non-geospatial prepositions to gauge their geospatial awareness, and (iii) utilizing a multidimensional scaling (MDS) experiment to assess the models' geospatial reasoning capabilities.
arXiv Detail & Related papers (2023-10-09T17:20:11Z) - Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese
Geographic Re-Ranking [61.60169764507917]
Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates.
We propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines.
arXiv Detail & Related papers (2023-09-04T13:44:50Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - Geographic Adaptation of Pretrained Language Models [29.81557992080902]
We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup.
We show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the pretrained language models.
arXiv Detail & Related papers (2022-03-16T11:55:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.