Related papers: ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

URL: http://arxiv.org/abs/2511.14366v2
Date: Thu, 20 Nov 2025 06:27:38 GMT
Title: ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
Authors: Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu Fang, Qianhong Ma, Taolin Zhang, Zihan Ma, Yufeng Zhao, Peiheng Zhou, Linchen Xiao, Wenlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, Jiaye Ge, Meng Li, Yuhong Liu, Jianxin Dong, Jiaying Li, Hui Wu, Hanwen Liang, Jintai Lin, Yanting Wang, Jie Dong, Tong Zhu, Tianfan Fu, Conghui He, Qi Zhang, Songyang Zhang, Lei Bai, Kai Chen,
Abstract summary: ATLAS is a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems.<n>Its key features include: High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage.<n>Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities.
Score: 118.46980291324148
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.

Related papers

Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration [63.61423859450929]
This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses.<n>We identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery.
arXiv Detail & Related papers (2026-01-20T18:46:42Z)
HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
CLINB: A Climate Intelligence Benchmark for Foundational Models [31.884362929125363]
We introduce CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks.<n>We implement and validate a model-based evaluation process and evaluate several frontier models.
arXiv Detail & Related papers (2025-10-29T16:15:42Z)
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark [49.42250115889234]
We present the first benchmark designed to test large language models (LLMs) on research-level reasoning tasks.<n>CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level.<n>We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges.
arXiv Detail & Related papers (2025-09-30T17:34:03Z)
The Ever-Evolving Science Exam [69.20851050366643]
We introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models.<n>Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, and 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations.
arXiv Detail & Related papers (2025-07-22T12:22:16Z)
AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science [9.757412158154514]
We present a novel benchmark designed to assess large language models (LLMs) across five core categories of atmospheric science problems.<n>AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs)<n>We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models.
arXiv Detail & Related papers (2025-02-03T08:50:46Z)
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.<n>These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.<n>Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.