Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model
- URL: http://arxiv.org/abs/2407.10991v1
- Date: Mon, 24 Jun 2024 07:29:43 GMT
- Title: Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model
- Authors: Hossein Ghorbanfekr, Pieter Jan Kerstens, Katrijn Dirix,
- Abstract summary: GEOBERTje is a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language.
We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI.
This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Geological borehole descriptions contain detailed textual information about the composition of the subsurface. However, their unstructured format presents significant challenges for extracting relevant features into a structured format. This paper introduces GEOBERTje: a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language. This model effectively extracts relevant information from the borehole descriptions and represents it into a numeric vector space. Showcasing just one potential application of GEOBERTje, we finetune a classifier model on a limited number of manually labeled observations. This classifier categorizes borehole descriptions into a main, second and third lithology class. We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI. This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions. This offers new opportunities for geological analysis and modeling using vast amounts of data.
Related papers
- Explaining Datasets in Words: Statistical Models with Natural Language Parameters [66.69456696878842]
We introduce a family of statistical models -- including clustering, time series, and classification models -- parameterized by natural language predicates.
We apply our framework to a wide range of problems: taxonomizing user chat dialogues, characterizing how they evolve across time, finding categories where one language model is better than the other.
arXiv Detail & Related papers (2024-09-13T01:40:20Z) - Hidden Holes: topological aspects of language models [1.1172147007388977]
We study the evolution of topological structure in GPT based large language models across depth and time during training.
We show that the latter exhibit more topological complexity, with a distinct pattern of changes common to all natural languages but absent from synthetically generated data.
arXiv Detail & Related papers (2024-06-09T14:25:09Z) - Node-Level Topological Representation Learning on Point Clouds [5.079602839359521]
We propose a novel method to extract node-level topological features from complex point clouds.
We verify the effectiveness of these topological point features on both synthetic and real-world data.
arXiv Detail & Related papers (2024-06-04T13:29:12Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties [53.177550970052174]
ProLab is a novel approach using property-level label space for creating strong interpretable segmentation models.
It uses descriptive properties grounded in common sense knowledge for supervising segmentation models.
arXiv Detail & Related papers (2023-12-21T11:43:41Z) - GPT Struct Me: Probing GPT Models on Narrative Entity Extraction [2.049592435988883]
We evaluate the capabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5 -- in the extraction of narrative entities.
This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles.
arXiv Detail & Related papers (2023-11-24T16:19:04Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Topologically Regularized Data Embeddings [22.222311627054875]
We introduce a new set of topological losses, and propose their usage as a way for topologically regularizing data embeddings to naturally represent a prespecified model.
We include experiments on synthetic and real data that highlight the usefulness and versatility of this approach.
arXiv Detail & Related papers (2021-10-18T11:25:47Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z) - Topological Data Analysis in Text Classification: Extracting Features
with Additive Information [2.1410799064827226]
Topological Data Analysis is challenging to apply to high dimensional numeric data.
Topological features carry some exclusive information not captured by conventional text mining methods.
Adding topological features to the conventional features in ensemble models improves the classification results.
arXiv Detail & Related papers (2020-03-29T21:02:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.