Related papers: NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

URL: http://arxiv.org/abs/2510.07172v1
Date: Wed, 08 Oct 2025 16:12:11 GMT
Title: NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents
Authors: Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See,
Abstract summary: Large language models are emerging as powerful tools for scientific law discovery.<n>Existing benchmarks for this task suffer from a fundamental methodological trilemma.<n>We introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains.
Score: 65.85967483058705
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

Related papers

Grounding LLMs in Scientific Discovery via Embodied Actions [84.11877211907647]
Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and physical simulation.<n>We propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by groundings in embodied actions with a tight perception-execution loop.
arXiv Detail & Related papers (2026-02-24T07:37:18Z)
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques [105.15622072347811]
Large language models (LLMs) have opened new avenues for accelerating scientific research.<n>We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models.
arXiv Detail & Related papers (2026-02-03T18:56:17Z)
Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration [63.61423859450929]
This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses.<n>We identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery.
arXiv Detail & Related papers (2026-01-20T18:46:42Z)
Deep Learning in Astrophysics [0.2700171473617699]
Deep learning has generated diverse perspectives in astronomy, with ongoing discussions between proponents and skeptics motivating this review.<n>We examine how neural networks complement classical statistics, extending our data analytical toolkit for modern surveys.<n>This review demonstrates how deep learning incorporates domain knowledge through architectural design, with built-in assumptions guiding models toward physically meaningful solutions.
arXiv Detail & Related papers (2025-10-12T17:31:46Z)
Autonomous Agents for Scientific Discovery: Orchestrating Scientists, Language, Code, and Physics [82.55776608452017]
Large language models (LLMs) provide a flexible and versatile framework that orchestrates interactions with human scientists, natural language, computer language and code, and physics.<n>This paper presents our view and vision of LLM-based scientific agents and their growing role in transforming the scientific discovery lifecycle.<n>We identify open research challenges and outline promising directions for building more robust, generalizable, and adaptive scientific agents.
arXiv Detail & Related papers (2025-10-10T22:26:26Z)
The Need for Verification in AI-Driven Scientific Discovery [9.887965168376311]
Machine learning and large language models can generate hypotheses at a scale and speed far exceeding traditional methods.<n>We argue that without scalable and reliable mechanisms for verification, scientific progress risks being hindered rather than being advanced.
arXiv Detail & Related papers (2025-09-01T11:50:04Z)
Can Language Models Discover Scaling Laws? [57.794209392781845]
This paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables.<n>For the first time, we demonstrate that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts.
arXiv Detail & Related papers (2025-07-27T05:45:26Z)
Position: Intelligent Science Laboratory Requires the Integration of Cognitive and Embodied AI [98.19195693735487]
We propose the paradigm of Intelligent Science Laboratories (ISLs)<n>ISLs are a multi-layered, closed-loop framework that deeply integrates cognitive and embodied intelligence.<n>We argue that such systems are essential for overcoming the current limitations of scientific discovery.
arXiv Detail & Related papers (2025-06-24T13:31:44Z)
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z)
PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration [9.216546947535244]
We introduce PiFlow, an information-theoretical framework for automated scientific discovery.<n>Our method significantly improves discovery efficiency, reflected by a 73.55% increase in the Area Under the Curve.<n>Overall, PiFlow serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery.
arXiv Detail & Related papers (2025-05-21T03:09:39Z)
Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs [23.608962459019278]
We introduce a novel benchmark to evaluate Large Language Models (LLMs) for scientific discovery in both natural and social sciences.<n>Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications.<n>We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases.
arXiv Detail & Related papers (2025-02-21T05:35:20Z)
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery [141.39722070734737]
We propose to enhance the knowledge-driven, abstract reasoning abilities of Large Language Models with the computational strength of simulations. We introduce Scientific Generative Agent (SGA), a bilevel optimization framework. We conduct experiments to demonstrate our framework's efficacy in law discovery and molecular design.
arXiv Detail & Related papers (2024-05-16T03:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.