Related papers: Can Interpretation Predict Behavior on Unseen Data?

Can Interpretation Predict Behavior on Unseen Data?

URL: http://arxiv.org/abs/2507.06445v1
Date: Tue, 08 Jul 2025 23:07:33 GMT
Title: Can Interpretation Predict Behavior on Unseen Data?
Authors: Victoria R. Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra,
Abstract summary: Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms.<n>This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution model behavior.
Score: 11.280404893713213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD data -- even when the rule's implementation does not rely on these hierarchical patterns, according to ablation tests. Our findings offer a proof-of-concept to motivate further interpretability work on predicting unseen model behavior.

Related papers

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Towards Generalizable and Interpretable Motion Prediction: A Deep Variational Bayes Approach [54.429396802848224]
This paper proposes an interpretable generative model for motion prediction with robust generalizability to out-of-distribution cases. For interpretability, the model achieves the target-driven motion prediction by estimating the spatial distribution of long-term destinations. Experiments on motion prediction datasets validate that the fitted model can be interpretable and generalizable.
arXiv Detail & Related papers (2024-03-10T04:16:04Z)
Predictive Churn with the Set of Good Models [61.00058053669447]
This paper explores connections between two seemingly unrelated concepts of predictive inconsistency.<n>The first, known as predictive multiplicity, occurs when models that perform similarly produce conflicting predictions for individual samples.<n>The second concept, predictive churn, examines the differences in individual predictions before and after model updates.
arXiv Detail & Related papers (2024-02-12T16:15:25Z)
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models [3.9052860539161918]
We propose a simple method for measuring a scale of models' reliance on any identified spurious feature. We assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA) We find that while existing debiasing methods can mitigate reliance on a chosen spurious feature, the OOD performance gains of these methods can not be explained by mitigated reliance on biased features.
arXiv Detail & Related papers (2023-05-11T14:35:00Z)
Guide the Learner: Controlling Product of Experts Debiasing Method Based on Token Attribution Similarities [17.082695183953486]
A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model. Here, the underlying assumption is that the biased model resorts to shortcut features. We introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts loss function.
arXiv Detail & Related papers (2023-02-06T15:21:41Z)
How robust are pre-trained models to distribution shift? [82.08946007821184]
We show how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE) We develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation.
arXiv Detail & Related papers (2022-06-17T16:18:28Z)
Pathologies of Pre-trained Language Models in Few-shot Fine-tuning [50.3686606679048]
We show that pre-trained language models with few examples show strong prediction bias across labels. Although few-shot fine-tuning can mitigate the prediction bias, our analysis shows models gain performance improvement by capturing non-task-related features. These observations alert that pursuing model performance with fewer examples may incur pathological prediction behavior.
arXiv Detail & Related papers (2022-04-17T15:55:18Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Understanding Out-of-distribution: A Perspective of Data Dynamics [5.811774625668462]
This paper explores how data dynamics in training models can be used to understand the fundamental differences between OOD and in-distribution samples. We found that syntactic characteristics of the data samples that the model consistently predicts incorrectly in both OOD and in-distribution cases directly contradict each.
arXiv Detail & Related papers (2021-11-29T17:34:38Z)
Evaluation metrics for behaviour modeling [2.616915680939834]
We propose and investigate metrics for evaluating and comparing generative models of behavior learned using imitation learning. These criteria look at longer temporal relationships in behavior, are relevant if behavior has some properties that are inherently unpredictable, and highlight biases in the overall distribution of behaviors produced by the model. We show that the proposed metrics correspond with biologists' intuition about behavior, and allow us to evaluate models, understand their biases, and enable us to propose new research directions.
arXiv Detail & Related papers (2020-07-23T23:47:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.