Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams
- URL: http://arxiv.org/abs/2504.08779v1
- Date: Fri, 04 Apr 2025 18:13:45 GMT
- Title: Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams
- Authors: Ruoxin Xiong, Yanyu Wang, Suat Gunhan, Yimin Zhu, Charles Berryman,
- Abstract summary: This study introduces CMExamSet, a benchmarking dataset comprising 689 authentic multiple-choice questions from four nationally accredited CM certification exams.<n>Results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively.<n> conceptual misunderstandings are the most common, underscoring the need for enhanced domain-specific reasoning models.
- Score: 2.897171041611256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.
Related papers
- Code Generation with Small Language Models: A Deep Evaluation on Codeforces [2.314213846671956]
Small Language Models offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks.<n>We benchmark five open SLMs across 280 Codeforces problems spanning Elo ratings from 800 to 2100.<n> PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%.
arXiv Detail & Related papers (2025-04-09T23:57:44Z) - MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers [10.311462547308823]
This work presents MMCR, a benchmark designed to evaluate Vision-Language Models' capacity for reasoning with cross-source information from scientific papers.<n>Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models.
arXiv Detail & Related papers (2025-03-21T05:02:20Z) - Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures.
We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators.
We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises [41.39610589639382]
We present RuozhiBench, a dataset containing 677 carefully curated questions that contain various forms of deceptive reasoning.<n>We evaluate 17 large language models (LLMs) from 5 Series over RuozhiBench using both open-ended and two-choice formats.<n>LLMs showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
arXiv Detail & Related papers (2025-02-18T18:47:11Z) - One for All: A General Framework of LLMs-based Multi-Criteria Decision Making on Human Expert Level [7.755152930120769]
We propose an evaluation framework to automatically deal with general complex MCDM problems.<n>Within the framework, we assess the performance of various typical open-source models, as well as commercial models such as Claude and ChatGPT.<n>The experimental results show that the accuracy rates for different applications improve significantly to around 95%, and the performance difference is trivial between different models.
arXiv Detail & Related papers (2025-02-17T06:47:20Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Responsible AI in Construction Safety: Systematic Evaluation of Large Language Models and Prompt Engineering [9.559203170987598]
Construction remains one of the most hazardous sectors.
Recent advancements in AI, particularly Large Language Models (LLMs), offer promising opportunities for enhancing workplace safety.
This study evaluates the performance of two widely used LLMs, GPT-3.5 and GPT-4o, across three standardized exams administered by the Board of Certified Safety Professionals (BCSP)
arXiv Detail & Related papers (2024-11-13T04:06:09Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web [69.6913064185993]
Language model agents (LMA) emerged as a promising paradigm on muti-step decision making tasks.<n>Despite the promise, their performance on real-world applications is still underexplored.<n>We show that while existing LMAs achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks.
arXiv Detail & Related papers (2023-11-30T17:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.