Related papers: An Expert-grounded benchmark of General Purpose LLMs in LCA

An Expert-grounded benchmark of General Purpose LLMs in LCA

URL: http://arxiv.org/abs/2510.19886v1
Date: Wed, 22 Oct 2025 15:56:54 GMT
Title: An Expert-grounded benchmark of General Purpose LLMs in LCA
Authors: Artur Donaldson, Bharathan Balaji, Cajetan Oriekezie, Manish Kumar, Laure Patouillard,
Abstract summary: Large language models (LLMs) are increasingly being explored as tools to support life cycle assessment (LCA)<n>This study provides the first expert-grounded benchmark of LLMs in LCA.
Score: 1.9645069537947935
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs na\"ively in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...

Related papers

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking [64.97768177044355]
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems.<n>We present FactArena, a fully automated arena-style evaluation framework.<n>Our analyses reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence.
arXiv Detail & Related papers (2026-01-06T02:51:56Z)
Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML? [4.0057196015831495]
Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting.<n>This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task.<n>We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations.
arXiv Detail & Related papers (2025-10-29T17:05:00Z)
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses [23.308803725940383]
DeCE is a decomposed LLM evaluation framework that separates precision (factual accuracy and relevance) and recall (coverage of required concepts)<n>We instantiate DeCE to evaluate different LLMs on a real-world legal QA task involving multi-jurisdictional reasoning and citation grounding.
arXiv Detail & Related papers (2025-09-19T15:36:02Z)
The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference [13.59675117792588]
Large language models are often assumed to acquire increasingly structured, generalizable internal representations simply by scaling data and parameters.<n>We interrogate this assumption by introducing a Clinical Trial Natural Language In Attribution benchmark comprising four reasoning families.<n>Each item is paired with a targeted Ground Knowledge and Meta-Level Reasoning Verification probe, allowing us to dissociate failures of factual access from failures of inference.
arXiv Detail & Related papers (2025-08-14T16:01:10Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity [27.10502683001428]
This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. Experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.
arXiv Detail & Related papers (2024-07-24T09:48:48Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment [9.156064716689833]
This study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation.<n>We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B.<n>Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art.
arXiv Detail & Related papers (2024-03-08T00:19:24Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.