Related papers: Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Interpret the Internal States of Recommendation Model with Sparse Autoencoder

URL: http://arxiv.org/abs/2411.06112v2
Date: Mon, 14 Jul 2025 09:07:27 GMT
Title: Interpret the Internal States of Recommendation Model with Sparse Autoencoder
Authors: Jiayin Wang, Xiaoyu Zhang, Weizhi Ma, Zhiqiang Guo, Min Zhang,
Abstract summary: RecSAE is an automated and generalizable probing framework that interprets Recommenders with Sparse AutoEncoder.<n>It extracts interpretable latents from the internal states of recommendation models and links them to semantic concepts for interpretation.<n> RecSAE does not alter original models during interpretation and also enables targeted de-biasing to models based on interpreted results.
Score: 28.234859617081295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recommendation model interpretation aims to reveal models' calculation process, enhancing their transparency, interpretability, and trustworthiness by clarifying the relationships between inputs, model activations, and outputs. However, the complex, often opaque nature of deep learning models complicates interpretation, and most existing methods are tailored to specific model architectures, limiting their generalizability across different types of recommendation models. To address these challenges, we propose RecSAE, an automated and generalizable probing framework that interprets Recommenders with Sparse AutoEncoder. It extracts interpretable latents from the internal states of recommendation models and links them to semantic concepts for interpretation. RecSAE does not alter original models during interpretation and also enables targeted de-biasing to models based on interpreted results. Specifically, RecSAE operates in three steps: First, it probes activations before the prediction layer to capture internal representations. Next, the RecSAE module is trained on these activations with a larger latent space and sparsity constraints, making the RecSAE latents more mono-semantic than the original model activations. Thirdly, RecSAE utilizes a language model to construct concept descriptions with confidence scores based on the relationships between latent activations and recommendation outputs. Experiments on three types of models (general, graph-based, and sequential) with three widely used datasets demonstrate the effectiveness and generalization of RecSAE framework. The interpreted concepts are further validated by human experts, showing strong alignment with human perception. Overall, RecSAE serves as a novel step in both model-level interpretations to various types of recommenders without affecting their functions and offering the potential for targeted tuning of models.

Related papers

LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
R3: Robust Rubric-Agnostic Reward Models [17.958272227199746]
$shortmethodname$ is a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments.<n>Our models, data, and code are available as open source at https://github.com/rubricreward/r3.
arXiv Detail & Related papers (2025-05-19T17:29:03Z)
LatentQA: Teaching LLMs to Decode Activations Into Natural Language [72.87064562349742]
We introduce LatentQA, the task of answering open-ended questions about model activations in natural language. We propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations.
arXiv Detail & Related papers (2024-12-11T18:59:33Z)
Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort [31.992947353231564]
Concept Bottleneck Models (CBMs) can provide a principled way of disclosing and guiding model behaviors through human-understandable concepts. We propose a novel framework designed to exploit pre-trained models while being immune to these biases, thereby reducing vulnerability to spurious correlations. We evaluate the proposed method on multiple datasets, and the results demonstrate its effectiveness in reducing model reliance on spurious correlations while preserving its interpretability.
arXiv Detail & Related papers (2024-07-12T03:07:28Z)
Self-supervised Interpretable Concept-based Models for Text Classification [9.340843984411137]
This paper proposes a self-supervised Interpretable Concept Embedding Models (ICEMs) We leverage the generalization abilities of Large-Language Models to predict the concepts labels in a self-supervised way. ICEMs can be trained in a self-supervised way achieving similar performance to fully supervised concept-based models and end-to-end black-box ones.
arXiv Detail & Related papers (2024-06-20T14:04:53Z)
Understanding Before Recommendation: Semantic Aspect-Aware Review Exploitation via Large Language Models [53.337728969143086]
Recommendation systems harness user-item interactions like clicks and reviews to learn their representations. Previous studies improve recommendation accuracy and interpretability by modeling user preferences across various aspects and intents. We introduce a chain-based prompting approach to uncover semantic aspect-aware interactions.
arXiv Detail & Related papers (2023-12-26T15:44:09Z)
RecExplainer: Aligning Large Language Models for Explaining Recommendation Models [50.74181089742969]
Large language models (LLMs) have demonstrated remarkable intelligence in understanding, reasoning, and instruction following. This paper presents the initial exploration of using LLMs as surrogate models to explain black-box recommender models. To facilitate an effective alignment, we introduce three methods: behavior alignment, intention alignment, and hybrid alignment.
arXiv Detail & Related papers (2023-11-18T03:05:43Z)
Interpreting and Controlling Vision Foundation Models via Text Explanations [45.30541722925515]
We present a framework for interpreting vision transformer's latent tokens with natural language. Our approach enables understanding of model visual reasoning procedure without needing additional model training or data collection.
arXiv Detail & Related papers (2023-10-16T17:12:06Z)
Evaluating and Explaining Large Language Models for Code Using Syntactic Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes. We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Interpretable Sentence Representation with Variational Autoencoders and Attention [0.685316573653194]
We develop methods to enhance the interpretability of recent representation learning techniques in natural language processing (NLP) We leverage Variational Autoencoders (VAEs) due to their efficiency in relating observations to latent generative factors. We build two models with inductive bias to separate information in latent representations into understandable concepts without annotated data.
arXiv Detail & Related papers (2023-05-04T13:16:15Z)
Explaining Language Models' Predictions with High-Impact Concepts [11.47612457613113]
We propose a complete framework for extending concept-based interpretability methods to NLP. We optimize for features whose existence causes the output predictions to change substantially. Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
arXiv Detail & Related papers (2023-05-03T14:48:27Z)
ProtoVAE: A Trustworthy Self-Explainable Prototypical Variational Model [18.537838366377915]
ProtoVAE is a variational autoencoder-based framework that learns class-specific prototypes in an end-to-end manner. It enforces trustworthiness and diversity by regularizing the representation space and introducing an orthonormality constraint.
arXiv Detail & Related papers (2022-10-15T00:42:13Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Combining Discrete Choice Models and Neural Networks through Embeddings: Formulation, Interpretability and Performance [10.57079240576682]
This study proposes a novel approach that combines theory and data-driven choice models using Artificial Neural Networks (ANNs) In particular, we use continuous vector representations, called embeddings, for encoding categorical or discrete explanatory variables. Our models deliver state-of-the-art predictive performance, outperforming existing ANN-based models while drastically reducing the number of required network parameters.
arXiv Detail & Related papers (2021-09-24T15:55:31Z)
InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents [60.785317191131284]
We introduce a simple and effective method for learning VAEs with controllable biases by using an intermediary set of latent variables. In particular, it allows us to impose desired properties like sparsity or clustering on learned representations. We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.
arXiv Detail & Related papers (2021-06-25T16:34:05Z)
Model Learning with Personalized Interpretability Estimation (ML-PIE) [2.862606936691229]
High-stakes applications require AI-generated models to be interpretable. Current algorithms for the synthesis of potentially interpretable models rely on objectives or regularization terms. We propose an approach for the synthesis of models that are tailored to the user.
arXiv Detail & Related papers (2021-04-13T09:47:48Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)
Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency. We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z)
Explaining and Improving Model Behavior with k Nearest Neighbor Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions. We show that kNN representations are effective at uncovering learned spurious associations. Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.