Related papers: SAIBench: Benchmarking AI for Science

Related papers

FrontierCS: Evolving Challenges for Evolving Intelligence [174.80075821079708]
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science.<n>For each problem we provide an expert reference solution and an automatic evaluator.<n>We find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks.
arXiv Detail & Related papers (2025-12-17T18:52:45Z)
Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z)
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [56.40989626804489]
This survey provides the first holistic analysis of Large Language Models-powered software engineering.<n>We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair.
arXiv Detail & Related papers (2025-10-10T06:56:50Z)
Barbarians at the Gate: How AI is Upending Systems Research [58.95406995634148]
We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery.<n>We term this approach as AI-Driven Research for Systems ( ADRS), which iteratively generates, evaluates, and refines solutions.<n>Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.
arXiv Detail & Related papers (2025-10-07T17:49:24Z)
An AI system to help scientists write expert-level empirical software [25.01900335784437]
We present an AI system that creates expert-level scientific software to maximize a quality metric.<n>The system achieves expert-level results when it explores and integrates complex research ideas from external sources.<n>In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard.<n>In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations.
arXiv Detail & Related papers (2025-09-08T10:08:36Z)
ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry [22.615102398311432]
We introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of deep AI research systems.<n>We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios.<n>OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions.
arXiv Detail & Related papers (2025-07-22T06:51:26Z)
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z)
MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks. This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents. We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z)
Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation [58.064940977804596]
A plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently. Ethical concerns regarding shortcomings of these tools and potential for misuse take a particularly prominent place in our discussion.
arXiv Detail & Related papers (2025-02-07T18:26:45Z)
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z)
SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving [64.38649623473626]
Large Language Models (LLMs) have driven substantial progress in artificial intelligence. We propose a novel framework called textbfSEquential subtextbfGoal textbfOptimization (SEGO) to enhance LLMs' ability to solve mathematical problems.
arXiv Detail & Related papers (2023-10-19T17:56:40Z)
Does AI for science need another ImageNet Or totally different benchmarks? A case study of machine learning force fields [5.622820801789953]
AI for science (AI4S) aims to enhance the accuracy and speed of scientific computing tasks using machine learning methods. Traditional AI benchmarking methods struggle to adapt to the unique challenges posed by AI4S because they assume data in training, testing, and future real-world queries are independent and identically distributed. This paper investigates the need for a novel approach to effectively benchmark AI for science, using the machine learning force field (MLFF) as a case study.
arXiv Detail & Related papers (2023-08-11T08:06:58Z)
Bridging MDE and AI: A Systematic Review of Domain-Specific Languages and Model-Driven Practices in AI Software Systems Engineering [1.4853133497896698]
This study aims to investigate the existing model-driven approaches relying on DSL in support of the engineering of AI software systems. The use of MDE for AI is still in its early stages, and there is no single tool or method that is widely used.
arXiv Detail & Related papers (2023-07-10T14:38:38Z)
OpenAGI: When LLM Meets Domain Experts [51.86179657467822]
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents. We introduce OpenAGI, an open-source platform designed for solving multi-step, real-world tasks.
arXiv Detail & Related papers (2023-04-10T03:55:35Z)
AI for Science: An Emerging Agenda [30.260160661295682]
This report documents the programme and the outcomes of Dagstuhl Seminar 22382 "Machine Learning for Science: Bridging Data-Driven and Mechanistic Modelling" The transformative potential of AI stems from its widespread applicability across disciplines, and will only be achieved through integration across research domains. Alongside technical advances, the next wave of progress in the field will come from building a community of machine learning researchers, domain experts, citizen scientists, and engineers.
arXiv Detail & Related papers (2023-03-07T20:21:43Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems [18.699431277588637]
We propose a scalable evaluation methodology (SAIH) for analyzing the AI performance trend of HPC systems. As the data and model constantly scale, we can investigate the trend and range of AI performance on HPC systems.
arXiv Detail & Related papers (2022-12-07T02:42:29Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
Software Engineering for AI-Based Systems: A Survey [8.550158373713906]
There is limited synthesized knowledge on Software Engineering approaches for building, operating, and maintaining AI-based systems. SE for AI-based systems is an emerging research area, where more than 2/3 of the studies have been published since 2018. The most studied properties of AI-based systems are dependability and safety.
arXiv Detail & Related papers (2021-05-05T11:22:08Z)
Artificial Intelligence for IT Operations (AIOPS) Workshop White Paper [50.25428141435537]
Artificial Intelligence for IT Operations (AIOps) is an emerging interdisciplinary field arising in the intersection between machine learning, big data, streaming analytics, and the management of IT operations. Main aim of the AIOPS workshop is to bring together researchers from both academia and industry to present their experiences, results, and work in progress in this field.
arXiv Detail & Related papers (2021-01-15T10:43:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.