Related papers: Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science

URL: http://arxiv.org/abs/2506.04410v1
Date: Wed, 04 Jun 2025 19:43:18 GMT
Title: Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science
Authors: Peter Jansen, Samiah Hassan, Ruoyao Wang,
Abstract summary: We introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims.<n>We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance.
Score: 1.7113423851651721
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contemporary approaches to assisted scientific discovery use language models to automatically generate large numbers of potential hypothesis to test, while also automatically generating code-based experiments to test those hypotheses. While hypotheses can be comparatively inexpensive to generate, automated experiments can be costly, particularly when run at scale (i.e. thousands of experiments). Developing the capacity to filter hypotheses based on their feasibility would allow discovery systems to run at scale, while increasing their likelihood of making significant discoveries. In this work we introduce Matter-of-Fact, a challenge dataset for determining the feasibility of hypotheses framed as claims. Matter-of-Fact includes 8.4k claims extracted from scientific articles spanning four high-impact contemporary materials science topics, including superconductors, semiconductors, batteries, and aerospace materials, while including qualitative and quantitative claims from theoretical, experimental, and code/simulation results. We show that strong baselines that include retrieval augmented generation over scientific literature and code generation fail to exceed 72% performance on this task (chance performance is 50%), while domain-expert verification suggests nearly all are solvable -- highlighting both the difficulty of this task for current models, and the potential to accelerate scientific discovery by making near-term progress.

Related papers

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
SciClaimHunt: A Large Dataset for Evidence-based Scientific Claim Verification [7.421845364041002]
We introduce two large-scale datasets, SciClaimHunt and SciClaimHunt_Num, derived from scientific research papers.<n>We propose several baseline models tailored for scientific claim verification to assess the effectiveness of these datasets.<n>We evaluate models trained on SciClaimHunt and SciClaimHunt_Num against existing scientific claim verification datasets to gauge their quality and reliability.
arXiv Detail & Related papers (2025-02-14T08:34:26Z)
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses [72.39144388083712]
It remains unclear whether large language models (LLMs) can autonomously generate novel and valid hypotheses in chemistry.<n>We develop a benchmark of 51 high-impact chemistry papers published and online after January 2024, each manually annotated by PhD chemists with background, inspirations, and hypothesis.<n>We assume that LLMs may already encode latent scientific knowledge associations not yet recognized by humans.
arXiv Detail & Related papers (2024-10-09T17:19:58Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)
Large Language Models are Zero Shot Hypothesis Proposers [17.612235393984744]
Large Language Models (LLMs) hold a wealth of global and interdisciplinary knowledge that promises to break down information barriers. We construct a dataset consist of background knowledge and hypothesis pairs from biomedical literature. We evaluate the hypothesis generation capabilities of various top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2023-11-10T10:03:49Z)
Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z)
Can ChatGPT be used to generate scientific hypotheses? [0.2010294990327175]
generative AI seems to be able to effectively structure vast amounts of scientific knowledge and provide interesting and testable hypotheses. The future scientific enterprise may include synergistic efforts with a swarm of "hypothesis machines", challenged by automated experimentation and adversarial peer reviews.
arXiv Detail & Related papers (2023-03-30T20:40:52Z)
GFlowNets for AI-Driven Scientific Discovery [74.27219800878304]
We present a new probabilistic machine learning framework called GFlowNets. GFlowNets can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. We argue that GFlowNets can become a valuable tool for AI-driven scientific discovery.
arXiv Detail & Related papers (2023-02-01T17:29:43Z)
SciFact-Open: Towards open-domain scientific claim verification [61.288725621156864]
We present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems. We collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models. We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1.
arXiv Detail & Related papers (2022-10-25T05:45:00Z)
Interpretable and Explainable Machine Learning for Materials Science and Chemistry [2.2175470459999636]
We summarize applications of interpretability and explainability techniques for materials science and chemistry. We discuss various challenges for interpretable machine learning in materials science and, more broadly, in scientific settings. We showcase a number of exciting developments in other fields that could benefit interpretability in material science and chemistry problems.
arXiv Detail & Related papers (2021-11-01T15:40:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.