Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains
- URL: http://arxiv.org/abs/2512.05700v1
- Date: Fri, 05 Dec 2025 13:28:29 GMT
- Title: Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains
- Authors: Ben Malin, Tatiana Kalganova, Nikolaos Boulgouris,
- Abstract summary: We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs)<n>The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric.<n>This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs). The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric, for the purpose of improving the faithfulness of LLM outputs. The proposed strategy for metric fusion deploys a tree-based model to identify the importance of each metric, which is driven by the integration of human judgements evaluating the faithfulness of LLM responses. This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness. Improving the ability to evaluate the faithfulness of LLMs, allows for greater confidence to be placed within models, allowing for their implementation in a greater diversity of scenarios. Additionally, we homogenise a collection of datasets across question answering and dialogue-based domains and implement human judgements and LLM responses within this dataset, allowing for the reproduction and trialling of faithfulness evaluation across domains.
Related papers
- On Evaluating LLM Alignment by Evaluating LLMs as Judges [68.15541137648721]
evaluating large language models' (LLMs) alignment requires them to be helpful, honest, safe, and to precisely follow human instructions.<n>We examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences.<n>We propose a benchmark that assesses alignment without directly evaluating model outputs.
arXiv Detail & Related papers (2025-11-25T18:33:24Z) - Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning [49.559151128219725]
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks.<n>However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness.<n>We propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets.
arXiv Detail & Related papers (2025-11-13T08:13:23Z) - Skewed Score: A statistical framework to assess autograders [2.9645858732618238]
"LLM-as-a-judge", or autograders, offer a scalable alternative to human evaluation.<n>They have shown mixed reliability and may exhibit systematic biases.<n>We propose a statistical framework that enables researchers to simultaneously assess their autograders.
arXiv Detail & Related papers (2025-07-04T18:45:10Z) - Reranking-based Generation for Unbiased Perspective Summarization [10.71668103641552]
We develop a test set for benchmarking metric reliability using human annotations.<n>We show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators.<n>Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
arXiv Detail & Related papers (2025-06-19T00:01:43Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - CER: Confidence Enhanced Reasoning in LLMs [2.4392539322920763]
We introduce an uncertainty-aware framework designed to enhance the accuracy of Large Language Models responses.<n>We quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation.<n>Results consistently validate the effectiveness of our novel confidence aggregation method.
arXiv Detail & Related papers (2025-02-20T15:16:42Z) - Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks.<n>They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct.<n>We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z) - FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom [19.104850413126066]
Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs)
Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers.
We propose FedEval-LLM that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools.
arXiv Detail & Related papers (2024-04-18T15:46:26Z) - FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models [36.273451767886726]
FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models.
FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies.
The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
arXiv Detail & Related papers (2024-04-09T04:17:51Z) - Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z) - PiCO: Peer Review in LLMs based on the Consistency Optimization [48.48819141999387]
We use peer-review mechanisms to measure large language models (LLMs) automatically.<n>We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.<n>We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z) - Fusion-Eval: Integrating Assistant Evaluators with LLMs [11.309661922644217]
'Fusion-Eval' is an innovative approach that leverages Large Language Models (LLMs) to integrate insights from various assistant evaluators.
We show that Fusion-Eval achieves a 0.962 system-level Kendall-Tau correlation with humans on SummEval and a 0.744 turn-level Spearman correlation on TopicalChat.
arXiv Detail & Related papers (2023-11-15T18:46:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.