Large language models in materials science and the need for open-source approaches
- URL: http://arxiv.org/abs/2511.10673v1
- Date: Mon, 10 Nov 2025 00:05:20 GMT
- Title: Large language models in materials science and the need for open-source approaches
- Authors: Fengxu Yang, Weitong Chen, Jack D. Evans,
- Abstract summary: Review examines recent large language models (LLMs) applications across the materials discovery pipeline.<n>We highlight how LLMs extract valuable information such as synthesis conditions from text.<n> benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, cost-effectiveness, and data privacy.<n>As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.
- Score: 3.35950184561189
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature , predictive modelling, and multi-agent experimental systems. We highlight how LLMs extract valuable information such as synthesis conditions from text, learn structure-property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed-source commercial models, our benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, reproducibility, cost-effectiveness, and data privacy. As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.
Related papers
- O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL [28.10102994309489]
We introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data.<n>Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning.<n>We develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method.
arXiv Detail & Related papers (2026-01-07T09:31:10Z) - Exploring LLMs for Scientific Information Extraction Using The SciEx Framework [12.534492015126757]
Large language models (LLMs) are touted as powerful tools for automating scientific information extraction.<n>We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation.<n>We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently.
arXiv Detail & Related papers (2025-12-10T19:00:20Z) - LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z) - Speed Always Wins: A Survey on Efficient Architectures for Large Language Models [51.817121227562964]
Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models.<n> Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties.<n>The traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment.
arXiv Detail & Related papers (2025-08-13T14:13:46Z) - Rethinking Scale: The Efficacy of Fine-Tuned Open-Source LLMs in Large-Scale Reproducible Social Science Research [0.0]
Large Language Models (LLMs) are distinguished by their architecture, which dictates their parameter size and performance capabilities.
Social scientists have increasingly adopted LLMs for text classification tasks, which are difficult to scale with human coders.
This study demonstrates that small, fine-tuned open-source LLMs can achieve equal or superior performance to models such as ChatGPT-4.
arXiv Detail & Related papers (2024-10-31T20:26:30Z) - A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.<n>Structured data is crucial for innovative and systematic materials design.<n>The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing [5.527358421206627]
We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs.
Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation.
arXiv Detail & Related papers (2024-02-29T18:54:46Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.