Related papers: Robustly identifying concepts introduced during chat fine-tuning using crosscoders

Robustly identifying concepts introduced during chat fine-tuning using crosscoders

URL: http://arxiv.org/abs/2504.02922v1
Date: Thu, 03 Apr 2025 17:50:24 GMT
Title: Robustly identifying concepts introduced during chat fine-tuning using crosscoders
Authors: Julian Minder, Clement Dumas, Caden Juang, Bilal Chugtai, Neel Nanda,
Abstract summary: Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models.<n>We identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models.<n>We train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts.
Score: 1.253890114209776
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of genuinely chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat tuning modifies language model behavior.

Related papers

Grokking ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior [25.975757048963413]
Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation.<n>We present ExPLAIND, a unified framework that integrates all three perspectives.
arXiv Detail & Related papers (2025-05-26T14:53:11Z)
Interpret the Internal States of Recommendation Model with Sparse Autoencoder [26.021277330699963]
RecSAE is an automatic, generalizable probing method for interpreting the internal states of Recommendation models. We train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models. We automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences.
arXiv Detail & Related papers (2024-11-09T08:22:31Z)
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework [2.8470354623829577]
We develop a framework based on Concept Bottleneck Models to enforce interpretability of time series Transformers. We modify the training objective to encourage a model to develop representations similar to predefined interpretable concepts. We find that the model performance remains mostly unaffected, while the model shows much improved interpretability.
arXiv Detail & Related papers (2024-10-08T14:22:40Z)
Linking Robustness and Generalization: A k* Distribution Analysis of Concept Clustering in Latent Space for Vision Models [56.89974470863207]
This article uses the k* Distribution, a local neighborhood analysis method, to examine the learned latent space at the level of individual concepts. We introduce skewness-based true and approximate metrics for interpreting individual concepts to assess the overall quality of vision models' latent space.
arXiv Detail & Related papers (2024-08-17T01:43:51Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
Investigating the Robustness of Modelling Decisions for Few-Shot Cross-Topic Stance Detection: A Preregistered Study [3.9394231697721023]
In this paper, we investigate the robustness of operationalization choices for few-shot stance detection. We compare stance task definitions (Pro/Con versus Same Side Stance), two LLM architectures (bi-encoding versus cross-encoding), and adding Natural Language Inference knowledge. Some of our hypotheses and claims from earlier work can be confirmed, while others give more inconsistent results.
arXiv Detail & Related papers (2024-04-05T09:48:00Z)
Scalable Model Editing via Customized Expert Networks [10.211286961377942]
We introduce scalable Model Editing via Customized Expert Networks (SCEN) In the first stage, we train lightweight expert networks individually for each piece of knowledge that needs to be updated. In the second stage, we train a corresponding indexing neuron for each expert to control the activation state of that expert.
arXiv Detail & Related papers (2024-04-03T12:57:19Z)
Predictive Churn with the Set of Good Models [61.00058053669447]
This paper explores connections between two seemingly unrelated concepts of predictive inconsistency. The first, known as predictive multiplicity, occurs when models that perform similarly produce conflicting predictions for individual samples. The second concept, predictive churn, examines the differences in individual predictions before and after model updates.
arXiv Detail & Related papers (2024-02-12T16:15:25Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
Dual Path Modeling for Semantic Matching by Perceiving Subtle Conflicts [14.563722352134949]
Transformer-based pre-trained models have achieved great improvements in semantic matching. Existing models still suffer from insufficient ability to capture subtle differences. We propose a novel Dual Path Modeling Framework to enhance the model's ability to perceive subtle differences.
arXiv Detail & Related papers (2023-02-24T09:29:55Z)
Meaningfully Explaining a Model's Mistakes [16.521189362225996]
We propose a systematic approach, conceptual explanation scores (CES) CES explains why a classifier makes a mistake on a particular test sample(s) in terms of human-understandable concepts. We also train new models with intentional and known spurious correlations, which CES successfully identifies from a single misclassified test sample.
arXiv Detail & Related papers (2021-06-24T01:49:55Z)
ModelDiff: Testing-Based DNN Similarity Comparison for Model Reuse Detection [9.106864924968251]
ModelDiff is a testing-based approach to deep learning model similarity comparison. A study on mobile deep learning apps has shown the feasibility of ModelDiff on real-world models.
arXiv Detail & Related papers (2021-06-11T15:16:18Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
Do Generative Models Know Disentanglement? Contrastive Learning is All You Need [59.033559925639075]
We propose an unsupervised and model-agnostic method: Disentanglement via Contrast (DisCo) in the Variation Space. DisCo achieves the state-of-the-art disentanglement given pretrained non-disentangled generative models, including GAN, VAE, and Flow.
arXiv Detail & Related papers (2021-02-21T08:01:20Z)
Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency. We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z)
VisBERT: Hidden-State Visualizations for Transformers [66.86452388524886]
We present VisBERT, a tool for visualizing the contextual token representations within BERT for the task of (multi-hop) Question Answering. VisBERT enables users to get insights about the model's internal state and to explore its inference steps or potential shortcomings.
arXiv Detail & Related papers (2020-11-09T15:37:43Z)
Concept Bottleneck Models [79.91795150047804]
State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs" We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. On x-ray grading and bird identification, concept bottleneck models achieve competitive accuracy with standard end-to-end models.
arXiv Detail & Related papers (2020-07-09T07:47:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.