Related papers: CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

URL: http://arxiv.org/abs/2509.20374v2
Date: Fri, 10 Oct 2025 15:05:13 GMT
Title: CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
Authors: Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan,
Abstract summary: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system remains underexplored.<n>We introduce CFDLLMBench, a benchmark suite designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD.
Score: 13.16419723805434
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

Related papers

CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants [2.2811622267552014]
Large Language Models (LLM) are increasingly used for software development.<n>Existing benchmarks for LLM-based coding assistance do not reflect the constraints of High Energy Physics and High Performance Computing software.<n>This paper develops practical, repeatable benchmarks that quantify LLM performance on HEP/ HPC-relevant tasks.
arXiv Detail & Related papers (2026-03-01T11:16:50Z)
An Agentic Framework for Autonomous Materials Computation [70.24472585135929]
Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery.<n>Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific experiments.<n>Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations.
arXiv Detail & Related papers (2025-12-22T15:03:57Z)
CFD-copilot: leveraging domain-adapted large language model and model context protocol to enhance simulation automation [6.71937346130764]
CFD-copilot is a framework designed to facilitate natural language-driven CFD simulation from setup to post-processing.<n>For post-processing, the framework utilizes the model context protocol (MCP), an open standard that decouples LLM reasoning from external tool execution.<n>The framework was evaluated on benchmarks including the NACA0012 airfoil and the three-element 30P-30N airfoil.
arXiv Detail & Related papers (2025-12-08T11:42:32Z)
Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification [21.987735608080374]
DafnyCOMP is a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny.<n>We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks.
arXiv Detail & Related papers (2025-09-27T02:33:08Z)
ChatCFD: An LLM-Driven Agent for End-to-End CFD Automation with Domain-Specific Structured Reasoning [4.098524616768554]
ChatCFD is an automated agent system for OpenFOAM simulations.<n>Its four-stage pipeline enables iterative trial-reflection-refinement for intricate setups.<n>ChatCFD shows strong potential as a modular component in MCP-based agent networks for collaborative multi-agent systems.
arXiv Detail & Related papers (2025-05-28T08:43:49Z)
SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation [5.880496520248658]
SIMCOPILOT is a benchmark that simulates the role of large language models (LLMs) as interactive, "copilot"-style coding assistants.<n>The benchmark comprises dedicated sub-benchmarks for Java (SIMCOPILOTJ) and Python.
arXiv Detail & Related papers (2025-05-21T04:59:44Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
MetaOpenFOAM: an LLM-based multi-agent framework for CFD [11.508919041921942]
MetaOpenFOAM is a novel multi-agent collaborations framework. It aims to complete CFD simulation tasks with only natural language as input. It harnesses the power of MetaGPT's assembly line paradigm.
arXiv Detail & Related papers (2024-07-31T04:01:08Z)
FLUID-LLM: Learning Computational Fluid Dynamics with Spatiotemporal-aware Large Language Models [15.964726158869777]
Large language models (LLMs) have shown remarkable pattern recognition and reasoning abilities. We introduce FLUID-LLM, a novel framework combining pre-trained LLMs with pre-aware encoding to predict unsteady fluid dynamics. Our results demonstrate that FLUID-LLM effectively integratestemporal information into pre-trained LLMs, enhancing CFD task performance.
arXiv Detail & Related papers (2024-06-06T20:55:40Z)
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z)
In Situ Framework for Coupling Simulation and Machine Learning with Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations. As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks. This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z)
An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols. We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.