Related papers: CEQuest: Benchmarking Large Language Models for Construction Estimation

CEQuest: Benchmarking Large Language Models for Construction Estimation

URL: http://arxiv.org/abs/2508.16081v1
Date: Fri, 22 Aug 2025 04:14:20 GMT
Title: CEQuest: Benchmarking Large Language Models for Construction Estimation
Authors: Yanzhao Wu, Lufan Wang, Rui Liu,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks.<n>However, their effectiveness in specialized fields, such as construction, remains underexplored.<n>We introduce CEQuest, a novel benchmark dataset designed to evaluate the performance of LLMs in answering construction-related questions.
Score: 3.929359686281298
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.

Related papers

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z)
AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field [12.465017512854475]
Large language models (LLMs) are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field.<n>This paper establishes AECBench, a benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain.<n>The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework.
arXiv Detail & Related papers (2025-09-23T08:09:58Z)
From Parameters to Performance: A Data-Driven Study on LLM Structure and Development [73.67759647072519]
Large language models (LLMs) have achieved remarkable success across various domains.<n>Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce.<n>We present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks.
arXiv Detail & Related papers (2025-09-14T12:20:39Z)
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey [49.1574468325115]
We conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations.<n>We provide detailed overviews within each category and highlight challenges in this field.<n>We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
arXiv Detail & Related papers (2025-05-21T19:17:29Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models [0.0]
This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems. It focuses on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain.
arXiv Detail & Related papers (2024-10-24T00:49:46Z)
A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence. This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z)
Mining experimental data from Materials Science literature with Large Language Models: an evaluation study [1.9849264945671101]
This study is dedicated to assessing the capabilities of large language models (LLMs) in extracting structured information from scientific documents in materials science. We focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline)
arXiv Detail & Related papers (2024-01-19T23:00:31Z)
Knowledge Plugins: Enhancing Large Language Models for Domain-Specific Recommendations [50.81844184210381]
We propose a general paradigm that augments large language models with DOmain-specific KnowledgE to enhance their performance on practical applications, namely DOKE. This paradigm relies on a domain knowledge extractor, working in three steps: 1) preparing effective knowledge for the task; 2) selecting the knowledge for each specific sample; and 3) expressing the knowledge in an LLM-understandable way.
arXiv Detail & Related papers (2023-11-16T07:09:38Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities [66.36633042421387]
Large Language Models (LLMs) for Knowledge Graph (KG) construction and reasoning evaluated.<n>We propose AutoKG, a multi-agent-based approach employing LLMs and external sources for KG construction and reasoning.
arXiv Detail & Related papers (2023-05-22T15:56:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.