DINO as a von Mises-Fisher mixture model
- URL: http://arxiv.org/abs/2405.10939v1
- Date: Fri, 17 May 2024 17:49:45 GMT
- Title: DINO as a von Mises-Fisher mixture model
- Authors: Hariprasath Govindarajan, Per Sidén, Jacob Roll, Fredrik Lindsten,
- Abstract summary: We show that DINO can be interpreted as a mixture model of von Mises-Fisher components.
We propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities.
We show that the added flexibility of the mixture model is beneficial in terms of better image representations.
- Score: 15.524425102344784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
Related papers
- VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations [49.1574468325115]
We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations.<n>The proposed approach employs an encoder to reduce the high-dimensional input $y(bmx)$ to a lower-dimensional latent space, $bmmu_bmphi_y$.<n>A fully connected neural network is used to map $bmmu_bmphi_y$ to the latent space, $bmmu_bmphi_h$, of the P
arXiv Detail & Related papers (2025-08-05T18:37:32Z) - MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.
We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.
Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z) - Diffusion models for probabilistic programming [56.47577824219207]
Diffusion Model Variational Inference (DMVI) is a novel method for automated approximate inference in probabilistic programming languages (PPLs)
DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model.
arXiv Detail & Related papers (2023-11-01T12:17:05Z) - NPEFF: Non-Negative Per-Example Fisher Factorization [52.44573961263344]
We introduce a novel interpretability method called NPEFF that is readily applicable to any end-to-end differentiable model.
We demonstrate that NPEFF has interpretable tunings through experiments on language and vision models.
arXiv Detail & Related papers (2023-10-07T02:02:45Z) - Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative
Models [49.81937966106691]
We develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models.
In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach.
arXiv Detail & Related papers (2023-06-15T16:30:08Z) - ELODI: Ensemble Logit Difference Inhibition for Positive-Congruent Training [110.52785254565518]
Existing methods to reduce the negative flip rate (NFR) either do so at the expense of overall accuracy by forcing a new model to imitate the old models, or use ensembles.
We analyze the role of ensembles in reducing NFR and observe that they remove negative flips that are typically not close to the decision boundary.
We present a method, called Ensemble Logit Difference Inhibition (ELODI), to train a classification system that achieves paragon performance in both error rate and NFR.
arXiv Detail & Related papers (2022-05-12T17:59:56Z) - Latent Time Neural Ordinary Differential Equations [0.2538209532048866]
We propose a novel approach to model uncertainty in NODE by considering a distribution over the end-time $T$ of the ODE solver.
We also propose, adaptive latent time NODE (ALT-NODE), which allow each data point to have a distinct posterior distribution over end-times.
We demonstrate the effectiveness of the proposed approaches in modelling uncertainty and robustness through experiments on synthetic and several real-world image classification data.
arXiv Detail & Related papers (2021-12-23T17:31:47Z) - Improving Robustness and Uncertainty Modelling in Neural Ordinary
Differential Equations [0.2538209532048866]
We propose a novel approach to model uncertainty in NODE by considering a distribution over the end-time $T$ of the ODE solver.
We also propose, adaptive latent time NODE (ALT-NODE), which allow each data point to have a distinct posterior distribution over end-times.
We demonstrate the effectiveness of the proposed approaches in modelling uncertainty and robustness through experiments on synthetic and several real-world image classification data.
arXiv Detail & Related papers (2021-12-23T16:56:10Z) - Exponentially Tilted Gaussian Prior for Variational Autoencoder [3.52359746858894]
Recent studies show that probabilistic generative models can perform poorly on this task.
We propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE)
We show that our model produces high quality image samples which are more crisp than that of a standard Gaussian VAE.
arXiv Detail & Related papers (2021-11-30T18:28:19Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - Normalizing Flow based Hidden Markov Models for Classification of Speech
Phones with Explainability [25.543231171094384]
In pursuit of explainability, we develop generative models for sequential data.
We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs)
The proposed generative models can compute likelihood of a data and hence directly suitable for maximum-likelihood (ML) classification approach.
arXiv Detail & Related papers (2021-07-01T20:10:55Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.