Related papers: Human-Calibrated Automated Testing and Validation of Generative Language Models

Human-Calibrated Automated Testing and Validation of Generative Language Models

URL: http://arxiv.org/abs/2411.16391v1
Date: Mon, 25 Nov 2024 13:53:36 GMT
Title: Human-Calibrated Automated Testing and Validation of Generative Language Models
Authors: Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal Malohlava,
Abstract summary: This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs) It focuses on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking.
Score: 3.2855317710497633
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

Related papers

AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment [12.9569411072262]
AutoBench is a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs)<n>This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.
arXiv Detail & Related papers (2025-10-26T09:20:39Z)
Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas [31.16720541398267]
We propose a doubly-robust estimation framework designed to address evaluation sampling bias.<n>Key to our approach is the use of "persona" ratings produced by prompting an evaluator to behave as a human rater.<n>We show that our approach yields valid system quality estimates when either (i) a model trained to predict human ratings using persona ratings and source data observed under sampling bias, or (ii) a reweighting model that corrects for sampling bias is of sufficient quality.
arXiv Detail & Related papers (2025-09-26T21:42:51Z)
CRACQ: A Multi-Dimensional Approach To Automated Document Assessment [0.0]
CRACQ is a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality.<n>It integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis.
arXiv Detail & Related papers (2025-09-26T17:01:54Z)
Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework [2.102846336724103]
Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses.<n>This work introduces a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation.
arXiv Detail & Related papers (2025-08-26T11:16:14Z)
Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework [0.0]
We propose a comprehensive framework for evaluating the quality of clinical data extracted by large language models (LLMs)<n>The framework integrates variable-level performance benchmarking against expert human abstraction, automated verification checks for internal consistency and plausibility, and replication analyses.<n>This multidimensional approach enables the identification of variables most in need of improvement, systematic detection of latent errors, and confirmation of dataset fitness-for-purpose in real-world research.
arXiv Detail & Related papers (2025-06-09T20:59:16Z)
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets [0.0]
Retrieval-Augmented Generation (RAG) has advanced significantly in recent years. RAG complexity poses substantial challenges for systematic evaluation and quality enhancement. This study systematically reviews 63 academic articles to provide a comprehensive overview of state-of-the-art RAG evaluation methodologies.
arXiv Detail & Related papers (2025-04-28T08:22:19Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection. Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains. We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models. We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z)
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation [48.54271457765236]
Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values. Current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. We propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments.
arXiv Detail & Related papers (2024-05-23T02:57:42Z)
CheckEval: Robust Evaluation Framework using Large Language Model via Checklist [6.713203569074019]
We introduce CheckEval, a novel evaluation framework using Large Language Models. CheckEval addresses the challenges of ambiguity and inconsistency in current evaluation methods.
arXiv Detail & Related papers (2024-03-27T17:20:39Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.