Extracting Accurate Materials Data from Research Papers with
Conversational Language Models and Prompt Engineering
- URL: http://arxiv.org/abs/2303.05352v3
- Date: Wed, 21 Feb 2024 12:07:30 GMT
- Title: Extracting Accurate Materials Data from Research Papers with
Conversational Language Models and Prompt Engineering
- Authors: Maciej P. Polak, Dane Morgan
- Abstract summary: ChatExtract can fully automate very accurate data extraction with minimal initial effort and background.
In tests on materials data we find precision and recall both close to 90% from the best conversational LLMs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been a growing effort to replace manual extraction of data from
research papers with automated data extraction based on natural language
processing, language models, and recently, large language models (LLMs).
Although these methods enable efficient extraction of data from large sets of
research papers, they require a significant amount of up-front effort,
expertise, and coding. In this work we propose the ChatExtract method that can
fully automate very accurate data extraction with minimal initial effort and
background, using an advanced conversational LLM. ChatExtract consists of a set
of engineered prompts applied to a conversational LLM that both identify
sentences with data, extract that data, and assure the data's correctness
through a series of follow-up questions. These follow-up questions largely
overcome known issues with LLMs providing factually inaccurate responses.
ChatExtract can be applied with any conversational LLMs and yields very high
quality data extraction. In tests on materials data we find precision and
recall both close to 90% from the best conversational LLMs, like ChatGPT-4. We
demonstrate that the exceptional performance is enabled by the information
retention in a conversational model combined with purposeful redundancy and
introducing uncertainty through follow-up prompts. These results suggest that
approaches similar to ChatExtract, due to their simplicity, transferability,
and accuracy are likely to become powerful tools for data extraction in the
near future. Finally, databases for critical cooling rates of metallic glasses
and yield strengths of high entropy alloys are developed using ChatExtract.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Leveraging Large Language Models for Web Scraping [0.0]
This research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation.
To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever.
arXiv Detail & Related papers (2024-06-12T14:15:15Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries [48.243879779374836]
Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning.
Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance.
We handle the task of conversation retrieval based on text summaries of the conversations.
A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search.
arXiv Detail & Related papers (2024-02-20T14:31:17Z) - SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion.
We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z) - Zero-shot information extraction from radiological reports using ChatGPT [19.457604666012767]
Information extraction is the strategy to transform the sequence of characters into structured data.
With the large language models achieving good performances on various downstream NLP tasks, it becomes possible to use large language models for zero-shot information extraction.
In this study, we aim to explore whether the most popular large language model, ChatGPT, can extract useful information from the radiological reports.
arXiv Detail & Related papers (2023-09-04T07:00:26Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models [5.748877272090607]
Large language models (LLMs) are transforming the way humans interact with text.
We demonstrate a simple and efficient method for extracting materials data from full-text research papers.
This approach requires minimal to no coding or prior knowledge about the extracted property.
It offers high recall and nearly perfect precision in the resulting database.
arXiv Detail & Related papers (2023-02-09T19:56:37Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.