Related papers: A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLMs

A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLMs

URL: http://arxiv.org/abs/2503.05050v2
Date: Mon, 07 Apr 2025 20:37:11 GMT
Title: A Unified Framework with Novel Metrics for Evaluating the Effectiveness of XAI Techniques in LLMs
Authors: Melkamu Abay Mersha, Mesay Gemeda Yigezu, Hassan Shakil, Ali K. AlShami, Sanghyun Byun, Jugal Kalita,
Abstract summary: This study introduces a comprehensive evaluation framework with four novel metrics for assessing the effectiveness of five XAI techniques.<n>The evaluation focuses on four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity.
Score: 5.112826806339356
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The increasing complexity of LLMs presents significant challenges to their transparency and interpretability, necessitating the use of eXplainable AI (XAI) techniques to enhance trustworthiness and usability. This study introduces a comprehensive evaluation framework with four novel metrics for assessing the effectiveness of five XAI techniques across five LLMs and two downstream tasks. We apply this framework to evaluate several XAI techniques LIME, SHAP, Integrated Gradients, Layer-wise Relevance Propagation (LRP), and Attention Mechanism Visualization (AMV) using the IMDB Movie Reviews and Tweet Sentiment Extraction datasets. The evaluation focuses on four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity. Our results show that LIME consistently achieves high scores across multiple LLMs and evaluation metrics, while AMV demonstrates superior Robustness and near-perfect Consistency. LRP excels in Contrastivity, particularly with more complex models. Our findings provide valuable insights into the strengths and limitations of different XAI methods, offering guidance for developing and selecting appropriate XAI techniques for LLMs.

Related papers

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models [6.349503549199403]
This study presents a general evaluation framework using four key metrics: Human-reasoning Agreement (HA), Robustness, Consistency, and Contrastivity.<n>We assess the effectiveness of six explainability techniques from five different XAI categories.<n>Our findings show that the model simplification-based XAI method (LIME) consistently outperforms across multiple metrics and models.
arXiv Detail & Related papers (2025-01-26T03:08:34Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models [12.112914393948415]
We present RUPBench, a benchmark designed to evaluate large language models (LLMs) across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns.
arXiv Detail & Related papers (2024-06-16T17:26:44Z)
Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs [1.0878040851638]
This paper surveys evaluation techniques to enhance the trustworthiness and understanding of Large Language Models (LLMs) Key evaluation metrics include Perplexity Measurement, NLP metrics (BLEU, ROUGE, METEOR, BERTScore, GLEU, Word Error Rate, Character Error Rate), Zero-Shot and Few-Shot Learning Performance, Transfer Learning Evaluation, Adversarial Testing, and Fairness and Bias Evaluation.
arXiv Detail & Related papers (2024-06-04T03:54:53Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era [77.174117675196]
XAI is being extended towards Large Language Models (LLMs) This paper analyzes how XAI can benefit LLMs and AI systems. We introduce 10 strategies, introducing the key techniques for each and discussing their associated challenges.
arXiv Detail & Related papers (2024-03-13T20:25:27Z)
METAL: Metamorphic Testing Framework for Analyzing Large-Language Model Qualities [4.493507573183107]
Large-Language Models (LLMs) have shifted the paradigm of natural language data processing. Recent studies have tested Quality Attributes (QAs) of LLMs by generating adversarial input texts. We propose a MEtamorphic Testing for Analyzing LLMs (METAL) framework to address these issues.
arXiv Detail & Related papers (2023-12-11T01:29:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.