LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
- URL: http://arxiv.org/abs/2504.10415v1
- Date: Mon, 14 Apr 2025 17:00:13 GMT
- Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
- Authors: Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy,
- Abstract summary: Large Language Models (LLMs) have gained interest for their potential to leverage embedded scientific knowledge for hypothesis generation.<n>Existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery.<n>In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains.<n>Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR- Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning
- Score: 20.800445482814958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
Related papers
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.
Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.
Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs [23.608962459019278]
We introduce a novel benchmark to evaluate Large Language Models (LLMs) for scientific discovery in both natural and social sciences.<n>Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications.<n>We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases.
arXiv Detail & Related papers (2025-02-21T05:35:20Z) - InductionBench: LLMs Fail in the Simplest Complexity Class [53.70978746199222]
Large language models (LLMs) have shown remarkable improvements in reasoning.<n>Inductive reasoning, where one infers the underlying rules from observed data, remains less explored.<n>We introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs.
arXiv Detail & Related papers (2025-02-20T03:48:00Z) - Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning [51.11965014462375]
Multimodal Large Language Models (MLLMs) integrate text, images, and other modalities.
This paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology.
arXiv Detail & Related papers (2025-02-05T04:05:27Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge.
These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application.
We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z) - LLM4ED: Large Language Models for Automatic Equation Discovery [0.8644909837301149]
We introduce a new framework that utilizes natural language-based prompts to guide large language models in automatically mining governing equations from data.
Specifically, we first utilize the generation capability of LLMs to generate diverse equations in string form, and then evaluate the generated equations based on observations.
Experiments are extensively conducted on both partial differential equations and ordinary differential equations.
arXiv Detail & Related papers (2024-05-13T14:03:49Z) - LLM-SR: Scientific Equation Discovery via Programming with Large Language Models [17.64574496035502]
Current methods of equation discovery, commonly known as symbolic regression, largely focus on extracting equations from data alone.<n>We introduce LLM-SR, a novel approach that leverages the scientific knowledge and robust code generation capabilities of Large Language Models.<n>We show that LLM-SR discovers physically accurate equations that significantly outperform state-of-the-art symbolic regression baselines.
arXiv Detail & Related papers (2024-04-29T03:30:06Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.