Related papers: Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Evaluating Explanations Through LLMs: Beyond Traditional User Studies

URL: http://arxiv.org/abs/2410.17781v1
Date: Wed, 23 Oct 2024 11:31:52 GMT
Title: Evaluating Explanations Through LLMs: Beyond Traditional User Studies
Authors: Francesco Bombassei De Bona, Gabriele Dominici, Tim Miller, Marc Langheinrich, Martin Gjoreski,
Abstract summary: We reproduce a user study comparing counterfactual and causal explanations, replicating human participants with seven Large Language Models (LLMs) Our results show that (i) LLMs can replicate most conclusions from the original study, (ii) different LLMs yield varying levels of alignment in the results, and (iii) experimental factors such as LLM memory and output variability affect alignment with human responses.
Score: 7.377398767507683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI becomes fundamental in sectors like healthcare, explainable AI (XAI) tools are essential for trust and transparency. However, traditional user studies used to evaluate these tools are often costly, time consuming, and difficult to scale. In this paper, we explore the use of Large Language Models (LLMs) to replicate human participants to help streamline XAI evaluation. We reproduce a user study comparing counterfactual and causal explanations, replicating human participants with seven LLMs under various settings. Our results show that (i) LLMs can replicate most conclusions from the original study, (ii) different LLMs yield varying levels of alignment in the results, and (iii) experimental factors such as LLM memory and output variability affect alignment with human responses. These initial findings suggest that LLMs could provide a scalable and cost-effective way to simplify qualitative XAI evaluation.

Related papers

A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification [0.0]
This study introduces the straightforward ensemble strategy to a sentiment analysis using large language models (LLMs) As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.
arXiv Detail & Related papers (2025-04-26T10:10:26Z)
Enhancing Patient-Centric Communication: Leveraging LLMs to Simulate Patient Perspectives [19.462374723301792]
Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing scenarios. By mimicking human behavior, LLMs can anticipate responses based on concrete demographic or professional profiles. We evaluate the effectiveness of LLMs in simulating individuals with diverse backgrounds and analyze the consistency of these simulated behaviors.
arXiv Detail & Related papers (2025-01-12T22:49:32Z)
LLM-Mirror: A Generated-Persona Approach for Survey Pre-Testing [0.0]
We investigate whether providing respondents' prior information can replicate both statistical distributions and individual decision-making patterns. We also introduce the concept of the LLM-Mirror, user personas generated by supplying respondent-specific information to the LLM. Our findings show that: (1) PLS-SEM analysis shows LLM-generated responses align with human responses, (2) LLMs are capable of reproducing individual human responses, and (3) LLM-Mirror responses closely follow human responses at the individual level.
arXiv Detail & Related papers (2024-12-04T09:39:56Z)
Enhancing LLM Evaluations: The Garbling Trick [0.0]
Large language models (LLMs) become increasingly powerful, making it challenging to distinguish between models based on their performance. We propose a general method to transform existing LLM evaluations into a series of progressively more difficult tasks. Our results offer insights into the comparative reasoning abilities of these models, particularly highlighting distinctions between OpenAI's o1-preview and Google's gemini-pro-1.5.
arXiv Detail & Related papers (2024-11-03T11:39:50Z)
The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead? [60.01746782465275]
Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks. This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership.
arXiv Detail & Related papers (2024-10-07T02:30:18Z)
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks. The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions. We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z)
Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks. In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Explaining Large Language Models Decisions Using Shapley Values [1.223779595809275]
Large language models (LLMs) have opened up exciting possibilities for simulating human behavior and cognitive processes. However, the validity of utilizing LLMs as stand-ins for human subjects remains uncertain. This paper presents a novel approach based on Shapley values to interpret LLM behavior and quantify the relative contribution of each prompt component to the model's output.
arXiv Detail & Related papers (2024-03-29T22:49:43Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.