Related papers: LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

URL: http://arxiv.org/abs/2601.11746v1
Date: Fri, 16 Jan 2026 19:55:06 GMT
Title: LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text
Authors: George Mihaila, Suleyman Olcay Polat, Poli Nemkova, Himanshu Sharma, Namratha V. Urs, Mark V. Albert,
Abstract summary: We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations.<n> Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box explainability.
Score: 7.194073942393882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict "Single Mask-Single Sample" protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.

Related papers

Improving Local Fidelity Through Sampling and Modeling Nonlinearity [3.7080015862513847]
Local Interpretable Model-agnostic Explanation (LIME) assumes that the local decision boundary is linear and fails to capture the non-linear relationships.<n>We propose a novel method that can generate high-fidelity explanations.
arXiv Detail & Related papers (2025-12-05T09:26:18Z)
Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification [9.397157329808254]
MUSE is a simple information-theoretic method to identify and aggregate well-calibrated subsets of large language models.<n> Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and na"ive ensemble baselines.
arXiv Detail & Related papers (2025-07-09T19:13:25Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation [0.0]
We explore uncertainty estimation as a proxy for correctness in LLM-generated code.<n>We adapt two state-of-the-art techniques from natural language generation to the domain of code generation.<n>Our findings indicate a strong correlation between the uncertainty computed through these techniques and correctness.
arXiv Detail & Related papers (2025-02-17T10:03:01Z)
Boosting Explainability through Selective Rationalization in Pre-trained Language Models [16.409817098221012]
The widespread application of pre-trained language models (PLMs) in natural language processing (NLP) has led to increasing concerns about their explainability.<n>Applying existing rationalization frameworks to PLMs will result in severe degeneration and failure problems, producing sub-optimal or meaningless rationales.<n>We propose a method named Pre-trained Language Model's Rationalization (PLMR) which splits PLMs into a generator and a predictor to deal with NLP tasks while providing interpretable rationales.
arXiv Detail & Related papers (2025-01-03T07:52:40Z)
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models [104.55763564037831]
We train a regression model that leverages attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens.<n>Our evaluation shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.
arXiv Detail & Related papers (2024-08-20T09:42:26Z)
DALD: Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing. We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs. We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z)
GLIME: General, Stable and Local LIME Explanation [11.002828804775392]
Local Interpretable Model-agnostic Explanations (LIME) is a widely adpoted method for understanding model behaviors. We introduce GLIME, an enhanced framework extending LIME and unifying several prior methods. By employing a local and unbiased sampling distribution, GLIME generates explanations with higher local fidelity compared to LIME.
arXiv Detail & Related papers (2023-11-27T11:17:20Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.