Lessons in Reproducibility: Insights from NLP Studies in Materials
Science
- URL: http://arxiv.org/abs/2307.15759v1
- Date: Fri, 28 Jul 2023 18:36:42 GMT
- Title: Lessons in Reproducibility: Insights from NLP Studies in Materials
Science
- Authors: Xiangyun Lei, Edward Kim, Viktoriia Baibakova, Shijing Sun
- Abstract summary: We aim to comprehend these studies from a perspective, acknowledging their significant influence on the field of materials informatics, rather than critiquing them.
Our study indicates that both papers offered thorough, tidy and well-documenteds, and clear guidance for model evaluation.
We highlight areas for improvement such as to provide access to training data where copyright restrictions permit, more transparency on model architecture and the training process, and specifications of software dependency versions.
- Score: 4.205692673448206
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Natural Language Processing (NLP), a cornerstone field within artificial
intelligence, has been increasingly utilized in the field of materials science
literature. Our study conducts a reproducibility analysis of two pioneering
works within this domain: "Machine-learned and codified synthesis parameters of
oxide materials" by Kim et al., and "Unsupervised word embeddings capture
latent knowledge from materials science literature" by Tshitoyan et al. We aim
to comprehend these studies from a reproducibility perspective, acknowledging
their significant influence on the field of materials informatics, rather than
critiquing them. Our study indicates that both papers offered thorough
workflows, tidy and well-documented codebases, and clear guidance for model
evaluation. This makes it easier to replicate their results successfully and
partially reproduce their findings. In doing so, they set commendable standards
for future materials science publications to aspire to. However, our analysis
also highlights areas for improvement such as to provide access to training
data where copyright restrictions permit, more transparency on model
architecture and the training process, and specifications of software
dependency versions. We also cross-compare the word embedding models between
papers, and find that some key differences in reproducibility and
cross-compatibility are attributable to design choices outside the bounds of
the models themselves. In summary, our study appreciates the benchmark set by
these seminal papers while advocating for further enhancements in research
reproducibility practices in the field of NLP for materials science. This
balance of understanding and continuous improvement will ultimately propel the
intersecting domains of NLP and materials science literature into a future of
exciting discoveries.
Related papers
- From Tokens to Materials: Leveraging Language Models for Scientific Discovery [12.211984932142537]
This study investigates the application of language model embeddings to enhance material property prediction in materials science.
We demonstrate that domain-specific models, particularly MatBERT, significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties.
arXiv Detail & Related papers (2024-10-21T16:31:23Z) - From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.
Structured data is crucial for innovative and systematic materials design.
The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews.
The newly emerging AI-generated literature reviews are also appraised.
This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction [23.489721319567025]
We discuss, quantify, and document challenges in automated information extraction from materials science literature.
This information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style.
We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.
arXiv Detail & Related papers (2023-10-12T14:57:24Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Semantic and Relational Spaces in Science of Science: Deep Learning
Models for Article Vectorisation [4.178929174617172]
We focus on document-level embeddings based on the semantic and relational aspects of articles, using Natural Language Processing (NLP) and Graph Neural Networks (GNNs)
Our results show that using NLP we can encode a semantic space of articles, while with GNN we are able to build a relational space where the social practices of a research community are also encoded.
arXiv Detail & Related papers (2020-11-05T14:57:41Z) - The SOFC-Exp Corpus and Neural Approaches to Information Extraction in
the Materials Science Domain [11.085048329202335]
We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications.
A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition.
We present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set.
arXiv Detail & Related papers (2020-06-04T17:49:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.