Related papers: BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction

BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction

URL: http://arxiv.org/abs/2510.16559v3
Date: Fri, 31 Oct 2025 05:31:37 GMT
Title: BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction
Authors: Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Yixian Jiang, Chenglei Yu, Tailin Wu,
Abstract summary: BuildArena is the first physics-aligned interactive benchmark designed for language-driven engineering construction.<n>It contributes to the community in four aspects: (1) a highly customizable benchmarking framework for in-depth comparison and analysis of LLMs; (2) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; and (3) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions.
Score: 11.450127891454267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated. To address this gap, we introduce BuildArena, the first physics-aligned interactive benchmark designed for language-driven engineering construction. It contributes to the community in four aspects: (1) a highly customizable benchmarking framework for in-depth comparison and analysis of LLMs; (2) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; (3) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions; (4) a baseline LLM agentic workflow that effectively evaluates diverse model capabilities. On eight frontier LLMs, BuildArena comprehensively evaluates their capabilities for language-driven and physics-grounded construction automation. The project page is at https://build-arena.github.io/.

Related papers

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z)
Beyond Basic Specifications? A Systematic Study of Logical Constructs in LLM-based Specification Generation [29.231420590756954]
Large language models (LLMs) for the automatic generation of program specifications has emerged as a promising avenue for enhancing verification efficiency.<n>We propose incorporating logical constructs into existing LLM-based specification generation framework.<n>We conduct an empirical study aimed at exploring the impact of various types of syntactic constructs on specification generation framework.
arXiv Detail & Related papers (2026-01-31T13:19:40Z)
Large Language Model Agent for User-friendly Chemical Process Simulations [0.0]
A large language model (LLM) agent is integrated with AVEVA Process Model Protocol (MCP), allowing natural language simulations.<n>Two case studies assess the framework across different task complexities and interaction modes.<n>The framework benefits both educational purposes, by translating technical concepts and demonstrating, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting.<n>While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework suggests LLM-based agents can become valuable collaborators.
arXiv Detail & Related papers (2026-01-15T12:18:45Z)
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models [78.73992315826035]
We introduce Youtu-LLM, a lightweight language model that harmonizes high computational efficiency with native agentic intelligence.<n>Youtu-LLM is pre-trained from scratch to systematically cultivate reasoning and planning capabilities.
arXiv Detail & Related papers (2025-12-31T04:25:11Z)
HELP: Hierarchical Embodied Language Planner for Household Tasks [75.38606213726906]
Embodied agents tasked with complex scenarios rely heavily on robust planning capabilities.<n>Large language models equipped with extensive linguistic knowledge can play this role.<n>We propose a Hierarchical Embodied Language Planner, called HELP, consisting of a set of LLM-based agents.
arXiv Detail & Related papers (2025-12-25T15:54:08Z)
Agentic Design of Compositional Machines [26.167638081496914]
We investigate whether large language models (LLMs) can learn to create machines.<n>We use BesiegeField, a testbed built on the machine-building game Besiege.<n>We benchmark state-of-the-art RLs with agentic and identify key capabilities required for success.
arXiv Detail & Related papers (2025-10-16T17:59:58Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z)
Integrating Large Language Models for Automated Structural Analysis [0.7373617024876725]
We propose a framework that integrates Large Language Models (LLMs) with structural analysis software.<n>LLMs parse structural descriptions from text and translate them into Python scripts.<n>It employs domain-specific prompt design and in-context learning strategies to enhance the LLM's problem-solving capabilities and generative stability.
arXiv Detail & Related papers (2025-04-13T23:10:33Z)
FEABench: Evaluating Language Models on Multiphysics Reasoning Ability [8.441945838936444]
We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA)<n>We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$circledR$, an FEA software, to compute the answers.
arXiv Detail & Related papers (2025-04-08T17:59:39Z)
Specifications: The missing link to making the development of LLM systems an engineering discipline [65.10077876035417]
We discuss the progress the field has made so far-through advances like structured outputs, process supervision, and test-time compute.<n>We outline several future directions for research to enable the development of modular and reliable LLM-based systems.
arXiv Detail & Related papers (2024-11-25T07:48:31Z)
Configurable Foundation Models: Building LLMs from a Modular Perspective [115.63847606634268]
A growing tendency to decompose LLMs into numerous functional modules allows for inference with part of modules and dynamic assembly of modules to tackle complex tasks. We coin the term brick to represent each functional module, designating the modularized structure as customizable foundation models. We present four brick-oriented operations: retrieval and routing, merging, updating, and growing. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions.
arXiv Detail & Related papers (2024-09-04T17:01:02Z)
LLM4EDA: Emerging Progress in Large Language Models for Electronic Design Automation [74.7163199054881]
Large Language Models (LLMs) have demonstrated their capability in context understanding, logic reasoning and answer generation. We present a systematic study on the application of LLMs in the EDA field. We highlight the future research direction, focusing on applying LLMs in logic synthesis, physical design, multi-modal feature extraction and alignment of circuits.
arXiv Detail & Related papers (2023-12-28T15:09:14Z)
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability. We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.