Related papers: Similarity of Processing Steps in Vision Model Representations

Similarity of Processing Steps in Vision Model Representations

URL: http://arxiv.org/abs/2601.21621v1
Date: Thu, 29 Jan 2026 12:24:28 GMT
Title: Similarity of Processing Steps in Vision Model Representations
Authors: Matéo Mahaut, Marco Baroni,
Abstract summary: We study the processes that lead to convergent representations in different models.<n>We find that layers at similar positions in different models have the most similar representations, strong differences remain.<n>CNN and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next.
Score: 5.816800536560847
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal'' representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations--in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.

Related papers

When Does Closeness in Distribution Imply Representational Similarity? An Identifiability Perspective [9.578534178372829]
We prove that a small Kullback--Leibler divergence between the model distributions does not guarantee that the corresponding representations are similar.<n>We then define a distributional distance for which closeness implies representational similarity.<n>In synthetic experiments, we find that wider networks learn distributions which are closer with respect to our distance and have more similar representations.
arXiv Detail & Related papers (2025-06-04T09:44:22Z)
Quantifying task-relevant representational similarity using decision variable correlation [6.323629460880451]
We propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision correlation variable (DVC)<n>We evaluate this method using monkey V4/IT recordings and models trained on image classification tasks.<n>We find that model--model similarity is comparable to monkey--monkey similarity, whereas model--monkey similarity is consistently lower and, surprisingly, decreases with increasing ImageNet-1k performance.
arXiv Detail & Related papers (2025-06-02T18:45:05Z)
Connecting Neural Models Latent Geometries with Relative Geodesic Representations [21.71782603770616]
We show that when a latent structure is shared between distinct latent spaces, relative distances between representations can be preserved, up to distortions.<n>We assume that distinct neural models parametrize approximately the same underlying manifold, and introduce a representation based on the pullback metric.<n>We validate our method on model stitching and retrieval tasks, covering autoencoders and vision foundation discriminative models.
arXiv Detail & Related papers (2025-06-02T12:34:55Z)
Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)<n>After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.<n>We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z)
Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth. We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity. Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z)
IMACS: Image Model Attribution Comparison Summaries [16.80986701058596]
We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques. IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features. We show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
arXiv Detail & Related papers (2022-01-26T21:35:14Z)
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept [56.46135010588918]
We prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent. It is shown that blank probabilities translate into segment length probabilities and vice versa.
arXiv Detail & Related papers (2021-04-13T11:20:48Z)
Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modeling [54.94763543386523]
Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the ( aggregate) posterior to encourage statistical independence of the latent factors. We present a novel multi-stage modeling approach where the disentangled factors are first learned using a penalty-based disentangled representation learning method. Then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables.
arXiv Detail & Related papers (2020-10-25T18:51:15Z)
Similarity Analysis of Contextual Word Representation Models [39.12749165544309]
We use existing and novel similarity measures to gauge the level of localization of information in the deep models. The analysis reveals that models within the same family are more similar to one another, as may be expected. Surprisingly, different architectures have rather similar representations, but different individual neurons.
arXiv Detail & Related papers (2020-05-03T19:48:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.