CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
- URL: http://arxiv.org/abs/2509.20374v2
- Date: Fri, 10 Oct 2025 15:05:13 GMT
- Title: CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
- Authors: Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan,
- Abstract summary: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system remains underexplored.<n>We introduce CFDLLMBench, a benchmark suite designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD.
- Score: 13.16419723805434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.
Related papers
- CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants [2.2811622267552014]
Large Language Models (LLM) are increasingly used for software development.<n>Existing benchmarks for LLM-based coding assistance do not reflect the constraints of High Energy Physics and High Performance Computing software.<n>This paper develops practical, repeatable benchmarks that quantify LLM performance on HEP/ HPC-relevant tasks.
arXiv Detail & Related papers (2026-03-01T11:16:50Z) - An Agentic Framework for Autonomous Materials Computation [70.24472585135929]
Large Language Models (LLMs) have emerged as powerful tools for accelerating scientific discovery.<n>Recent advances integrate LLMs into agentic frameworks, enabling retrieval, reasoning, and tool use for complex scientific experiments.<n>Here, we present a domain-specialized agent designed for reliable automation of first-principles materials computations.
arXiv Detail & Related papers (2025-12-22T15:03:57Z) - CFD-copilot: leveraging domain-adapted large language model and model context protocol to enhance simulation automation [6.71937346130764]
CFD-copilot is a framework designed to facilitate natural language-driven CFD simulation from setup to post-processing.<n>For post-processing, the framework utilizes the model context protocol (MCP), an open standard that decouples LLM reasoning from external tool execution.<n>The framework was evaluated on benchmarks including the NACA0012 airfoil and the three-element 30P-30N airfoil.
arXiv Detail & Related papers (2025-12-08T11:42:32Z) - Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification [21.987735608080374]
DafnyCOMP is a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny.<n>We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks.
arXiv Detail & Related papers (2025-09-27T02:33:08Z) - ChatCFD: An LLM-Driven Agent for End-to-End CFD Automation with Domain-Specific Structured Reasoning [4.098524616768554]
ChatCFD is an automated agent system for OpenFOAM simulations.<n>Its four-stage pipeline enables iterative trial-reflection-refinement for intricate setups.<n>ChatCFD shows strong potential as a modular component in MCP-based agent networks for collaborative multi-agent systems.
arXiv Detail & Related papers (2025-05-28T08:43:49Z) - SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation [5.880496520248658]
SIMCOPILOT is a benchmark that simulates the role of large language models (LLMs) as interactive, "copilot"-style coding assistants.<n>The benchmark comprises dedicated sub-benchmarks for Java (SIMCOPILOTJ) and Python.
arXiv Detail & Related papers (2025-05-21T04:59:44Z) - Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z) - MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - MetaOpenFOAM: an LLM-based multi-agent framework for CFD [11.508919041921942]
MetaOpenFOAM is a novel multi-agent collaborations framework.
It aims to complete CFD simulation tasks with only natural language as input.
It harnesses the power of MetaGPT's assembly line paradigm.
arXiv Detail & Related papers (2024-07-31T04:01:08Z) - FLUID-LLM: Learning Computational Fluid Dynamics with Spatiotemporal-aware Large Language Models [15.964726158869777]
Large language models (LLMs) have shown remarkable pattern recognition and reasoning abilities.
We introduce FLUID-LLM, a novel framework combining pre-trained LLMs with pre-aware encoding to predict unsteady fluid dynamics.
Our results demonstrate that FLUID-LLM effectively integratestemporal information into pre-trained LLMs, enhancing CFD task performance.
arXiv Detail & Related papers (2024-06-06T20:55:40Z) - UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs [74.1976921342982]
This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency.
The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow.
arXiv Detail & Related papers (2024-04-11T09:17:12Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.