MaScQA: A Question Answering Dataset for Investigating Materials Science
Knowledge of Large Language Models
- URL: http://arxiv.org/abs/2308.09115v1
- Date: Thu, 17 Aug 2023 17:51:05 GMT
- Title: MaScQA: A Question Answering Dataset for Investigating Materials Science
Knowledge of Large Language Models
- Authors: Mohd Zaki, Jayadeva, Mausam, N. M. Anoop Krishnan
- Abstract summary: This work curates a dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials student.
It is observed that GPT-4 gives the best performance (62% accuracy) as compared to GPT-3.5.
- Score: 29.70397245624547
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Information extraction and textual comprehension from materials literature
are vital for developing an exhaustive knowledge base that enables accelerated
materials discovery. Language models have demonstrated their capability to
answer domain-specific questions and retrieve information from knowledge bases.
However, there are no benchmark datasets in the materials domain that can
evaluate the understanding of the key concepts by these language models. In
this work, we curate a dataset of 650 challenging questions from the materials
domain that require the knowledge and skills of a materials student who has
cleared their undergraduate degree. We classify these questions based on their
structure and the materials science domain-based subcategories. Further, we
evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions
via zero-shot and chain of thought prompting. It is observed that GPT-4 gives
the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in
contrast to the general observation, no significant improvement in accuracy is
observed with the chain of thought prompting. To evaluate the limitations, we
performed an error analysis, which revealed conceptual errors (~64%) as the
major contributor compared to computational errors (~36%) towards the reduced
performance of LLMs. We hope that the dataset and analysis performed in this
work will promote further research in developing better materials science
domain-specific LLMs and strategies for information extraction.
Related papers
- Foundational Large Language Models for Materials Research [22.77591279242839]
Large Language Models (LLMs) offer opportunities to accelerate materials research through automated analysis and prediction.
Here, we present LLaMat, a family of foundational models for materials science developed through continued pretraining of LLaMA models.
We demonstrate that LLaMat excels in materials-specific NLP and structured information extraction while maintaining general linguistic capabilities.
arXiv Detail & Related papers (2024-12-12T18:46:38Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.
We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.
Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines [2.0330684186105805]
This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines.
Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy.
arXiv Detail & Related papers (2024-05-06T04:06:45Z) - Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data.
The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers.
To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z) - Mining experimental data from Materials Science literature with Large Language Models: an evaluation study [1.9849264945671101]
This study is dedicated to assessing the capabilities of large language models (LLMs) in extracting structured information from scientific documents in materials science.
We focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities.
The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline)
arXiv Detail & Related papers (2024-01-19T23:00:31Z) - Knowledge Graph Question Answering for Materials Science (KGQA4MAT): Developing Natural Language Interface for Metal-Organic Frameworks Knowledge Graph (MOF-KG) Using LLM [35.208135795371795]
We present a benchmark dataset for Knowledge Graph Question Answering in Materials Science (KGQA4MAT)
A knowledge graph for metal-organic frameworks (MOF-KG) has been constructed by integrating structured databases and knowledge extracted from the literature.
We have developed a benchmark comprised of 161 complex questions involving comparison, aggregation, and complicated graph structures.
arXiv Detail & Related papers (2023-09-20T14:43:43Z) - Knowledge-Augmented Reasoning Distillation for Small Language Models in
Knowledge-Intensive Tasks [90.11273439036455]
Large Language Models (LLMs) have shown promising performance in knowledge-intensive reasoning tasks.
We propose Knowledge-Augmented Reasoning Distillation (KARD), a novel method that fine-tunes small LMs to generate rationales from LLMs with augmented knowledge retrieved from an external knowledge base.
We empirically show that KARD significantly improves the performance of small T5 and GPT models on the challenging knowledge-intensive reasoning datasets.
arXiv Detail & Related papers (2023-05-28T13:00:00Z) - LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities [66.36633042421387]
Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning evaluated.
We propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning.
arXiv Detail & Related papers (2023-05-22T15:56:44Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.