Related papers: RobustExplain: Evaluating Robustness of LLM-Based Explanation Agents for Recommendation

RobustExplain: Evaluating Robustness of LLM-Based Explanation Agents for Recommendation

URL: http://arxiv.org/abs/2601.19120v3
Date: Tue, 03 Feb 2026 01:12:22 GMT
Title: RobustExplain: Evaluating Robustness of LLM-Based Explanation Agents for Recommendation
Authors: Guilin Zhang, Kai Zhao, Jeffrey Friedman, Xu Chu,
Abstract summary: Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems.<n>In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences.<n>We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations.
Score: 8.70920344844399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems, acting as explanation agents that reason over user behavior histories. While prior work has focused on explanation fluency and relevance under fixed inputs, the robustness of LLM-generated explanations to realistic user behavior noise remains largely unexplored. In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences, raising concerns about explanation stability and user trust. We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations. RobustExplain introduces five realistic user behavior perturbations evaluated across multiple severity levels and a multi-dimensional robustness metric capturing semantic, keyword, structural, and length consistency. Our goal is to establish a principled, task-level evaluation framework and initial robustness baselines, rather than to provide a comprehensive leaderboard across all available LLMs. Experiments on four representative LLMs (7B--70B) show that current models exhibit only moderate robustness, with larger models achieving up to 8% higher stability. Our results establish the first robustness benchmarks for explanation agents and highlight robustness as a critical dimension for trustworthy, agent-driven recommender systems at web scale.

Related papers

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities [2.9203730377983654]
Existing benchmarks fail to reflect real-world use or isolate compliance from task success.<n>We introduce MOSAIC, a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints.<n>We show that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position.
arXiv Detail & Related papers (2026-01-26T15:02:15Z)
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals [18.015918696398085]
Concept-based explanations quantify how high-level concepts influence model behavior.<n>Existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy.<n>We introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy.
arXiv Detail & Related papers (2026-01-15T18:54:50Z)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning [2.1461777157838724]
We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in large language models (LLMs) reasoning.<n>Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability.<n>We further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability.
arXiv Detail & Related papers (2025-12-08T18:26:58Z)
Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models [57.33350664910483]
We introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings.<n>We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios.
arXiv Detail & Related papers (2025-11-12T06:06:29Z)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection [71.8243083897721]
Vision-language models often hallucinate details, generating non-existent objects or inaccurate attributes that compromise output reliability.<n>We present a novel framework that leverages the model's self-consistency between long responses and short answers to generate preference pairs for training.
arXiv Detail & Related papers (2025-09-27T10:37:11Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions [23.024212585005714]
Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages.<n>This paper aims to provide a comprehensive terminology of concepts and methods around this field and facilitate the community.
arXiv Detail & Related papers (2025-06-08T16:20:12Z)
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments [23.514446188485838]
We argue for a method of moments evaluation over the space of meaning-preserving prompt perturbations.<n>We show that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity.
arXiv Detail & Related papers (2025-05-28T09:40:48Z)
Statistical Runtime Verification for LLMs via Robustness Estimation [0.0]
Adversarial robustness verification is essential for ensuring the safe deployment of Large Language Models (LLMs) in runtime-critical applications.<n>This paper presents a case study adapting and extending the RoMA statistical verification framework to assess its feasibility as an online runtime robustness monitor for LLMs in black-box deployment settings.
arXiv Detail & Related papers (2025-04-24T16:36:19Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.