Related papers: Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning

URL: http://arxiv.org/abs/2506.08235v1
Date: Mon, 09 Jun 2025 21:04:39 GMT
Title: Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim $\rightarrow$ Evidence Reasoning
Authors: Shashidhar Reddy Javaji, Yupeng Cao, Haohang Li, Yangyang Yu, Nikhil Muralidhar, Zining Zhu,
Abstract summary: We present CLAIM-BENCH, a benchmark for evaluating large language models' capabilities in scientific claim-evidence extraction and validation.<n>We show that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall.<n> strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims.
Score: 6.043212666944194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly being used for complex research tasks such as literature review, idea generation, and scientific paper analysis, yet their ability to truly understand and process the intricate relationships within complex research papers, such as the logical links between claims and supporting evidence remains largely unexplored. In this study, we present CLAIM-BENCH, a comprehensive benchmark for evaluating LLMs' capabilities in scientific claim-evidence extraction and validation, a task that reflects deeper comprehension of scientific argumentation. We systematically compare three approaches which are inspired by divide and conquer approaches, across six diverse LLMs, highlighting model-specific strengths and weaknesses in scientific comprehension. Through evaluation involving over 300 claim-evidence pairs across multiple research domains, we reveal significant limitations in LLMs' ability to process complex scientific content. Our results demonstrate that closed-source models like GPT-4 and Claude consistently outperform open-source counterparts in precision and recall across claim-evidence identification tasks. Furthermore, strategically designed three-pass and one-by-one prompting approaches significantly improve LLMs' abilities to accurately link dispersed evidence with claims, although this comes at increased computational cost. CLAIM-BENCH sets a new standard for evaluating scientific comprehension in LLMs, offering both a diagnostic tool and a path forward for building systems capable of deeper, more reliable reasoning across full-length papers.

Related papers

Roadmap for using large language models (LLMs) to accelerate cross-disciplinary research with an example from computational biology [0.0]
Large language models (LLMs) are powerful artificial intelligence (AI) tools transforming how research is conducted.<n>Their use in research has been met with skepticism, due to concerns about hallucinations, biases and potential harms to research.<n>Here, we present a roadmap for integrating LLMs into cross-disciplinary research.
arXiv Detail & Related papers (2025-07-04T17:20:14Z)
LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research [32.35279830326718]
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery.<n>However, their capability in reproducing code from research papers, especially in the NLP domain, remains underexplored.<n>We present LMR-BENCH, a benchmark designed to evaluate the capability of LLM agents on code reproduction from Language Modeling Research.
arXiv Detail & Related papers (2025-06-19T07:04:16Z)
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning [64.7438151207189]
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research.<n>We present textbfARtificial textbfIntelligence research assistant for textbfExpert-involved textbfLearning (ARIEL)
arXiv Detail & Related papers (2025-05-03T14:21:48Z)
LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models [20.800445482814958]
Large Language Models (LLMs) have gained interest for their potential to leverage embedded scientific knowledge for hypothesis generation.<n>Existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery.<n>In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains.<n>Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR- Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning
arXiv Detail & Related papers (2025-04-14T17:00:13Z)
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models [20.648157071328807]
Large language models (LLMs) can identify novel research directions by analyzing existing knowledge. LLMs are prone to generating hallucinations'', outputs that are plausible-sounding but factually incorrect. We propose KG-CoI, a system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs.
arXiv Detail & Related papers (2024-11-04T18:50:00Z)
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis [26.111514038691837]
SciAssess is a benchmark for the comprehensive evaluation of Large Language Models (LLMs) in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), memorization (L2), and Analysis & Reasoning (L3) It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.
arXiv Detail & Related papers (2024-03-04T12:19:28Z)
A Reliable Knowledge Processing Framework for Combustion Science using Foundation Models [0.0]
The study introduces an approach to process diverse combustion research data, spanning experimental studies, simulations, and literature. The developed approach minimizes computational and economic expenses while optimizing data privacy and accuracy. The framework consistently delivers accurate domain-specific responses with minimal human oversight.
arXiv Detail & Related papers (2023-12-31T17:15:25Z)
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM) SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.