Evaluating Large Language Models for Real-World Engineering Tasks
- URL: http://arxiv.org/abs/2505.13484v1
- Date: Mon, 12 May 2025 14:05:23 GMT
- Title: Evaluating Large Language Models for Real-World Engineering Tasks
- Authors: Rene Heesch, Sebastian Eilermann, Alexander Windmann, Alexander Diedrich, Philipp Rosenthal, Oliver Niggemann,
- Abstract summary: This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
- Score: 75.97299249823972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases, often adapted from examination materials where correctness is easily verifiable, and (ii) the use of ad hoc scenarios that insufficiently capture critical engineering competencies. Consequently, the assessment of LLMs on complex, real-world engineering problems remains largely unexplored. This paper addresses this gap by introducing a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios, systematically designed to cover core competencies such as product design, prognosis, and diagnosis. Using this dataset, we evaluate four state-of-the-art LLMs, including both cloud-based and locally hosted instances, to systematically investigate their performance on complex engineering tasks. Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
Related papers
- Computational Thinking Reasoning in Large Language Models [69.28428524878885]
Computational Thinking Model (CTM) is a novel framework that incorporates computational thinking paradigms into large language models (LLMs)<n>Live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing.<n>CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability.
arXiv Detail & Related papers (2025-06-03T09:11:15Z) - Practical Considerations for Agentic LLM Systems [5.455744338342196]
This paper frames actionable insights and considerations from the research community in the context of established application paradigms.<n> Namely, we position relevant research findings into four broad categories--Planning, Memory Tools, and Control Flow--based on common practices in application-focused literature.
arXiv Detail & Related papers (2024-12-05T11:57:49Z) - EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark [10.265704144939503]
Large language models (LLMs) and large multimodal models (LMMs) have demonstrated promising skills in various domains including science and mathematics.<n>We propose EEE-Bench, a multimodal benchmark aimed at assessing LMMs' capabilities in solving practical engineering tasks.<n>Our benchmark consists of 2860 carefully curated problems spanning 10 essential such as analog circuits, control systems, etc.
arXiv Detail & Related papers (2024-11-03T09:17:56Z) - Language Agents Meet Causality -- Bridging LLMs and Causal World Models [50.79984529172807]
We propose a framework that integrates causal representation learning with large language models.
This framework learns a causal world model, with causal variables linked to natural language expressions.
We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities.
arXiv Detail & Related papers (2024-10-25T18:36:37Z) - CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge [44.59258397967782]
Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing tasks.
We present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities.
We find that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge.
arXiv Detail & Related papers (2024-07-30T05:40:32Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - A Case Study on Test Case Construction with Large Language Models:
Unveiling Practical Insights and Challenges [2.7029792239733914]
This paper examines the application of Large Language Models in the construction of test cases within the context of software engineering.
Through a blend of qualitative and quantitative analyses, this study assesses the impact of LLMs on test case comprehensiveness, accuracy, and efficiency.
arXiv Detail & Related papers (2023-12-19T20:59:02Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.