Related papers: A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

URL: http://arxiv.org/abs/2509.06813v1
Date: Mon, 08 Sep 2025 15:48:17 GMT
Title: A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs
Authors: Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya,
Abstract summary: This paper presents a framework for benchmarking Large Language Models (LLMs) on the task of classifying complex industrial records.<n>To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool.<n>We quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task's semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.

Related papers

A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms [20.241519889633285]
Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms play a critical role.<n>We conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS.<n>We introduce MIMeBench, a new open-ended benchmark that targets two foundational yet underexplored semantic capabilities.
arXiv Detail & Related papers (2026-01-19T17:23:45Z)
Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs [21.656551146954587]
Large Language Models (LLMs) offer a path to automation.<n>We introduce a novel, structured dataset from 190 corporate reports.<n>Our results reveal a clear performance gap between qualitative and quantitative tasks.
arXiv Detail & Related papers (2025-12-30T15:28:03Z)
Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency [5.771786260272727]
We present a framework, BRACE, to benchmark Code Language Models on a unified scale of energy efficiency and functional correctness.<n>We propose two rating methods: Concentric Incremental Rating Circles (CIRC) and Observation to Expectation Rating (OTER)<n>Our analysis reveals models generally perform better in the code summarization tasks as they are not enforced to generate a grammar-based and syntactically correct output.
arXiv Detail & Related papers (2025-11-10T23:44:48Z)
EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation [17.37840331449749]
We propose a self-Evolving Pairwise Reasoning (EvolvR) framework for story evaluation.<n>The framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy.<n>The evaluator trained on the refined data is deployed as a reward model to guide the story generation task.
arXiv Detail & Related papers (2025-08-08T06:10:47Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Sustainability via LLM Right-sizing [21.17523328451591]
Large language models (LLMs) have become increasingly embedded in organizational.<n>This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks.<n>Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint.
arXiv Detail & Related papers (2025-04-17T04:00:40Z)
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models [79.41859481668618]
Large Language Models (LLMs) have significantly advanced the fact-checking studies.<n>Existing automated fact-checking evaluation methods rely on static datasets and classification metrics.<n>We introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities.
arXiv Detail & Related papers (2025-02-25T07:44:22Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset. We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z)
Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks [20.072783454089098]
This paper presents AutoEval, a novel benchmark for scaling Large Language Model (LLM) assessment in formal tasks with clear notions of correctness.<n>AutoEval is the first benchmarking paradigm that offers several key advantages necessary for scaling objective evaluation of LLMs without human labeling.
arXiv Detail & Related papers (2024-10-11T00:56:37Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.