Related papers: On the Factual Consistency of Text-based Explainable Recommendation Models

On the Factual Consistency of Text-based Explainable Recommendation Models

URL: http://arxiv.org/abs/2512.24366v1
Date: Tue, 30 Dec 2025 17:25:15 GMT
Title: On the Factual Consistency of Text-based Explainable Recommendation Models
Authors: Ben Kabongo, Vincent Guigue,
Abstract summary: We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders.<n>We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews.<n>We propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations.
Score: 2.2153783542347805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.

Related papers

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals [18.015918696398085]
Concept-based explanations quantify how high-level concepts influence model behavior.<n>Existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy.<n>We introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy.
arXiv Detail & Related papers (2026-01-15T18:54:50Z)
Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review [37.98161722413899]
Pull request (PR) review is essential for ensuring software quality, yet it remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics.<n>We present Sphinx, a unified framework for PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real
arXiv Detail & Related papers (2026-01-06T18:49:56Z)
Structured Prompting Enables More Robust Evaluation of Language Models [38.53918044830268]
We present a DSPy+HELM framework that introduces structured prompting methods which elicit reasoning.<n>We find that without structured prompting, HELM underestimates LM performance (by 4% average) and performance estimates vary more across benchmarks.<n>This is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework.
arXiv Detail & Related papers (2025-11-25T20:37:59Z)
FIRE: Faithful Interpretable Recommendation Explanations [2.6499018693213316]
Natural language explanations in recommender systems are often framed as a review generation task.<n>Fire is a lightweight and interpretable framework that combines SHAP-based feature attribution with structured, prompt-driven language generation.<n>Our results demonstrate that FIRE not only achieves competitive recommendation accuracy but also significantly improves explanation quality along critical dimensions such as alignment, structure, and faithfulness.
arXiv Detail & Related papers (2025-08-07T10:11:02Z)
eX-NIDS: A Framework for Explainable Network Intrusion Detection Leveraging Large Language Models [3.8436076642278745]
This paper introduces eX-NIDS, a framework designed to enhance interpretability in flow-based Network Intrusion Detection Systems (NIDS)<n>In our proposed framework, flows labelled as malicious by NIDS are initially processed through a module called the Prompt Augmenter.<n>This module extracts contextual information and Cyber Threat Intelligence (CTI)-related knowledge from these flows.<n>This enriched, context-specific data is then integrated with an input prompt for an LLM, enabling it to generate detailed explanations and interpretations of why the flow was identified as malicious by NIDS.
arXiv Detail & Related papers (2025-07-22T05:26:21Z)
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments [23.514446188485838]
We argue for a method of moments evaluation over the space of meaning-preserving prompt perturbations.<n>We show that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit substantial prompt sensitivity.
arXiv Detail & Related papers (2025-05-28T09:40:48Z)
Long-Form Information Alignment Evaluation Beyond Atomic Facts [60.25969380388974]
We introduce MontageLie, a benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations.<n>We propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency.
arXiv Detail & Related papers (2025-05-21T17:46:38Z)
Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Unlocking the Potential of Large Language Models for Explainable Recommendations [55.29843710657637]
It remains uncertain what impact replacing the explanation generator with the recently emerging large language models (LLMs) would have. In this study, we propose LLMXRec, a simple yet effective two-stage explainable recommendation framework. By adopting several key fine-tuning techniques, controllable and fluent explanations can be well generated.
arXiv Detail & Related papers (2023-12-25T09:09:54Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.