LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
- URL: http://arxiv.org/abs/2508.20875v2
- Date: Fri, 17 Oct 2025 15:10:50 GMT
- Title: LeMat-Traj: A Scalable and Unified Dataset of Materials Trajectories for Atomistic Modeling
- Authors: Ali Ramlaoui, Martin Siron, Inel Djafar, Joseph Musielewicz, Amandine Rossello, Victor Schmidt, Alexandre Duval,
- Abstract summary: We introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories.<n>LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals.<n>We also present LeMaterial-Fetcher, a modular and open-source library designed to provide a reproducible framework for the community to easily incorporate new data sources.
- Score: 34.31458248589154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of accurate machine learning interatomic potentials (MLIPs) is limited by the fragmented availability and inconsistent formatting of quantum mechanical trajectory datasets derived from Density Functional Theory (DFT). These datasets are expensive to generate yet difficult to combine due to variations in format, metadata, and accessibility. To address this, we introduce LeMat-Traj, a curated dataset comprising over 120 million atomic configurations aggregated from large-scale repositories, including the Materials Project, Alexandria, and OQMD. LeMat-Traj standardizes data representation, harmonizes results and filters for high-quality configurations across widely used DFT functionals (PBE, PBESol, SCAN, r2SCAN). It significantly lowers the barrier for training transferrable and accurate MLIPs. LeMat-Traj spans both relaxed low-energy states and high-energy, high-force structures, complementing molecular dynamics and active learning datasets. By fine-tuning models pre-trained on high-force data with LeMat-Traj, we achieve a significant reduction in force prediction errors on relaxation tasks. We also present LeMaterial-Fetcher, a modular and extensible open-source library developed for this work, designed to provide a reproducible framework for the community to easily incorporate new data sources and ensure the continued evolution of large-scale materials datasets. LeMat-Traj and LeMaterial-Fetcher are publicly available at https://huggingface.co/datasets/LeMaterial/LeMat-Traj and https://github.com/LeMaterial/lematerial-fetcher.
Related papers
- MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interatomic Potentials [11.867736304906508]
MatRIS is an invariant MLIP that introduces attention-based modeling of three-body interactions.<n>MatRIS delivers accuracy comparable to that of leading equivariant models on a wide range of popular benchmarks.
arXiv Detail & Related papers (2026-03-02T15:52:41Z) - TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics [56.073642366268764]
TokaMind is an open-source foundation model framework for fusion plasma modeling.<n>It is trained on heterogeneous tokamak diagnostics from the publicly available MAST dataset.<n>We evaluate TokaMind on the recently introduced MAST benchmark TokaMark.
arXiv Detail & Related papers (2026-02-16T12:26:07Z) - Team, Then Trim: An Assembly-Line LLM Framework for High-Quality Tabular Data Generation [4.818677616222802]
This paper introduces Team-then-Trim (T$2$), a framework that synthesizes high-quality data through a collaborative team of LLMs.<n>In T$2$, specialized LLMs, guided by domain knowledge, are tasked with generating different data components sequentially.<n> Empirical results on both simulated and real-world datasets demonstrate that T$2$ outperforms state-of-the-art methods in producing high-quality data.
arXiv Detail & Related papers (2026-02-04T17:34:41Z) - A Materials Map Integrating Experimental and Computational Data via Graph-Based Machine Learning for Enhanced Materials Discovery [5.06756291053173]
Materials informatics (MI) is expected to significantly accelerate material development and discovery.<n>Data used in MI are derived from both computational and experimental studies.<n>In this study, we use the obtained datasets to construct materials maps, which visualize the relationships between material properties and structural features.
arXiv Detail & Related papers (2025-03-10T14:31:34Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks.
These resources lack a classification tailored to Software Engineering (SE) needs.
We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials
Modeling [7.142619575624596]
MatSci ML is a benchmark for modeling MATerials SCIence using Machine Learning (MatSci ML) methods.
MatSci ML provides a diverse set of materials systems and properties data for model training and evaluation.
In the multi-dataset learning setting, MatSci ML enables researchers to combine observations from multiple datasets to perform joint prediction of common properties.
arXiv Detail & Related papers (2023-09-12T03:08:37Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.