Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis
- URL: http://arxiv.org/abs/2509.25281v1
- Date: Mon, 29 Sep 2025 07:03:09 GMT
- Title: Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis
- Authors: Yingming Pu, Tao Lin, Hongyu Chen,
- Abstract summary: The capacity of Large Language Models to generate valid scientific hypotheses for materials synthesis remains largely unquantified.<n>We introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains.<n>Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles.
- Score: 9.216546947535244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.
Related papers
- Think like a Scientist: Physics-guided LLM Agent for Equation Discovery [22.586956876641406]
Large language models (LLMs) have emerged as promising tools for symbolic equation discovery.<n>We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.<n>KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.
arXiv Detail & Related papers (2026-02-12T18:49:27Z) - LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning [107.60117957760794]
We introduce LatentChem, a latent reasoning interface that decouples chemical derivation from textual generation.<n>We show that LatentChem achieves a 59.88% non-tie win rate over strong CoT-based baselines on ChemCoTBench.<n>Our results provide empirical evidence that chemical reasoning is more naturally and effectively realized as continuous latent dynamics.
arXiv Detail & Related papers (2026-02-06T01:28:27Z) - How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z) - oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning [44.36582860924775]
We introduce oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry.<n>We also propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity.
arXiv Detail & Related papers (2025-10-09T03:13:31Z) - PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
arXiv Detail & Related papers (2025-10-03T17:09:03Z) - PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors [24.52206735857088]
We introduce PhysGym, a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning.<n> PhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent.<n>The benchmark comprises a suite of interactive simulations, where agents must actively probe environments.
arXiv Detail & Related papers (2025-07-21T12:28:10Z) - Bridging the Plausibility-Validity Gap by Fine-Tuning a Reasoning-Enhanced LLM for Chemical Synthesis and Discovery [0.0]
Large Language Models often generate scientifically plausible but factually invalid information.<n>This paper presents a systematic methodology to bridge this gap by developing a specialized scientific assistant.
arXiv Detail & Related papers (2025-07-09T23:05:23Z) - Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z) - MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search [93.64235254640967]
Large language models (LLMs) have shown promise in automating scientific hypothesis generation.<n>We define the novel task of fine-grained scientific hypothesis discovery.<n>We propose a hierarchical search method that incrementally proposes and integrates details into the hypothesis.
arXiv Detail & Related papers (2025-05-25T16:13:46Z) - LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models [20.800445482814958]
Large Language Models (LLMs) have gained interest for their potential to leverage embedded scientific knowledge for hypothesis generation.<n>Existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery.<n>In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains.<n>Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR- Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning
arXiv Detail & Related papers (2025-04-14T17:00:13Z) - JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models (LLMs)<n>JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.<n>Our experimental results reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses [72.39144388083712]
It remains unclear whether large language models (LLMs) can autonomously generate novel and valid hypotheses in chemistry.<n>We develop a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by PhD chemists with background, inspirations, and hypothesis.<n>We assume that LLMs may already encode latent scientific knowledge associations not yet recognized by humans.
arXiv Detail & Related papers (2024-10-09T17:19:58Z) - Leveraging large language models for nano synthesis mechanism explanation: solid foundations or mere conjectures? [12.874860522120326]
We develop a benchmark consisting of 775 multiple-choice questions focusing on the mechanisms of gold nanoparticles synthesis.
We propose a novel evaluation metric, the confidence-based score (c-score), which probes the output logits to derive the precise probability for the correct answer.
arXiv Detail & Related papers (2024-07-12T02:05:59Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.