Less can be more for predicting properties with large language models
- URL: http://arxiv.org/abs/2406.17295v3
- Date: Wed, 09 Jul 2025 17:37:23 GMT
- Title: Less can be more for predicting properties with large language models
- Authors: Nawaf Alampara, Santiago Miret, Kevin Maik Jablonka,
- Abstract summary: We report on limitations in large language models' ability to learn from coordinate information in coordinate-category data.<n>We find that LLMs consistently fail to capture coordinate information while excelling at category patterns.<n>Our findings suggest immediate practical implications for materials property prediction tasks dominated by structural effects.
- Score: 5.561723952524538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting properties from coordinate-category data -- sets of vectors paired with categorical information -- is fundamental to computational science. In materials science, this challenge manifests as predicting properties like formation energies or elastic moduli from crystal structures comprising atomic positions (vectors) and element types (categorical information). While large language models (LLMs) have increasingly been applied to such tasks, with researchers encoding structural data as text, optimal strategies for achieving reliable predictions remain elusive. Here, we report fundamental limitations in LLM's ability to learn from coordinate information in coordinate-category data. Through systematic experiments using synthetic datasets with tunable coordinate and category contributions, combined with a comprehensive benchmarking framework (MatText) spanning multiple representations and model scales, we find that LLMs consistently fail to capture coordinate information while excelling at category patterns. This geometric blindness persists regardless of model size (up to 70B parameters), dataset scale (up to 2M structures), or text representation strategy. Our findings suggest immediate practical implications: for materials property prediction tasks dominated by structural effects, specialized geometric architectures consistently outperform LLMs by significant margins, as evidenced by a clear "GNN-LM wall" in performance benchmarks. Based on our analysis, we provide concrete guidelines for architecture selection in scientific machine learning, while highlighting the critical importance of understanding model inductive biases when tackling scientific prediction problems.
Related papers
- Pre-trained Large Language Models Learn Hidden Markov Models In-context [10.06882436449576]
Hidden Models (HMMs) are tools for modeling sequential data with latentian structure, yet fitting them to real-world data remains computationally challenging.<n>We show that pre-trained language (LLMs) can effectively learn data generated via in-context learning.
arXiv Detail & Related papers (2025-06-08T21:49:38Z) - Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures [49.19753720526998]
We derive theoretical scaling laws for neural network performance on synthetic datasets.<n>We validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance.<n>This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.
arXiv Detail & Related papers (2025-05-11T17:44:14Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.
Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.
A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.
We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z) - A Materials Map Integrating Experimental and Computational Data via Graph-Based Machine Learning for Enhanced Materials Discovery [5.06756291053173]
Materials informatics (MI) is expected to significantly accelerate material development and discovery.<n>Data used in MI are derived from both computational and experimental studies.<n>In this study, we use the obtained datasets to construct materials maps, which visualize the relationships between material properties and structural features.
arXiv Detail & Related papers (2025-03-10T14:31:34Z) - Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis [0.0]
Machine learning in materials science faces challenges due to limited experimental data.<n>We propose strategies that utilize large language models (LLMs) to enhance machine learning performance.
arXiv Detail & Related papers (2025-03-06T16:04:01Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - MatExpert: Decomposing Materials Discovery by Mimicking Human Experts [26.364419690908992]
MatExpert is a novel framework that leverages Large Language Models and contrastive learning to accelerate the discovery and design of new solid-state materials.
Inspired by the workflow of human materials design experts, our approach integrates three key stages: retrieval, transition, and generation.
MatExpert represents a meaningful advancement in computational material discovery using langauge-based generative models.
arXiv Detail & Related papers (2024-10-26T00:44:54Z) - A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources.
We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z) - From Tokens to Materials: Leveraging Language Models for Scientific Discovery [12.211984932142537]
This study investigates the application of language model embeddings to enhance material property prediction in materials science.
We demonstrate that domain-specific models, particularly MatBERT, significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties.
arXiv Detail & Related papers (2024-10-21T16:31:23Z) - How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension [53.6373473053431]
This work introduces a benchmark to assess large language models' capabilities in graph pattern tasks.<n>We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions.<n>Our benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models.
arXiv Detail & Related papers (2024-10-04T04:48:33Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - MaterioMiner -- An ontology-based text mining dataset for extraction of process-structure-property entities [0.0]
We present the MaterioMiner dataset and the materials ontology where ontological concepts are associated with textual entities.
We explore the consistency between the three raters and perform fine-process-trained models to showcase the feasibility of named-process recognition model training.
arXiv Detail & Related papers (2024-08-05T21:42:59Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.<n>Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Mining experimental data from Materials Science literature with Large Language Models: an evaluation study [1.9849264945671101]
This study is dedicated to assessing the capabilities of large language models (LLMs) in extracting structured information from scientific documents in materials science.
We focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities.
The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline)
arXiv Detail & Related papers (2024-01-19T23:00:31Z) - How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored.
Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges.
We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z) - Generative retrieval-augmented ontologic graph and multi-agent
strategies for interpretive large language model-based materials design [0.0]
Transformer neural networks show promising capabilities, in particular for uses in materials analysis, design and manufacturing.
Here we explore the use of large language models (LLMs) as a tool that can support engineering analysis of materials.
arXiv Detail & Related papers (2023-10-30T20:31:50Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - Leveraging Language Representation for Material Recommendation, Ranking,
and Exploration [0.0]
We introduce a material discovery framework that uses natural language embeddings derived from language models as representations of compositional and structural features.
By applying the framework to thermoelectrics, we demonstrate diversified recommendations of prototype structures and identify under-studied high-performance material spaces.
arXiv Detail & Related papers (2023-05-01T21:58:29Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - MatSciBERT: A Materials Domain Language Model for Text Mining and
Information Extraction [13.924666106089425]
MatSciBERT is a language model trained on a large corpus of scientific literature published in the materials domain.
We show that MatSciBERT outperforms SciBERT on three downstream tasks, namely, abstract classification, named entity recognition, and relation extraction.
We also discuss some of the applications of MatSciBERT in the materials domain for extracting information.
arXiv Detail & Related papers (2021-09-30T17:35:02Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.