Canonicalizing Multimodal Contrastive Representation Learning
- URL: http://arxiv.org/abs/2602.17584v1
- Date: Thu, 19 Feb 2026 18:09:36 GMT
- Title: Canonicalizing Multimodal Contrastive Representation Learning
- Authors: Sharut Gupta, Sanyam Kansal, Stefanie Jegelka, Phillip Isola, Vikas Garg,
- Abstract summary: We show that across model families such as CLIP, SigLIP, and FLAVA, a geometric relationship exists between their embedding spaces.<n>This finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations.
- Score: 76.15228959754727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(\widetilde{f},\widetilde{g})$) -- trained on different distributions and with different architectures -- does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map $Q$ where $Q^\top Q = I$ such that $\widetilde{f}(x)\approx Q f(x)$ for paired images $x$. Strikingly, the same $Q$ simultaneously aligns the text encoders i.e., $\widetilde{g}(y)\approx Q g(y)$ for texts $y$. Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. $\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$, then the two models must be related by a single orthogonal map $Q$ and the same $Q$ maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: https://canonical-multimodal.github.io/
Related papers
- Model-agnostic basis functions for the 2-point correlation function of dark matter in linear theory [0.0]
We find a basis $mathcalB$ that describes $xi_rm lin(r)$ near the baryon acoustic oscillation feature in a wide class of cosmological models.<n>Using our basis functions in model-agnostic BAO analyses can potentially lead to significant statistical gains.
arXiv Detail & Related papers (2024-10-28T18:00:01Z) - Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis [54.57279006229212]
Information exponent has played an important role in predicting the sample complexity of online gradient descent.<n>In this work, we show that by considering both second- and higher-order terms, we can first learn the relevant space using the second-order terms.<n>The overall sample and complexity of online SGD is $tildeO( d PL-1 )$.
arXiv Detail & Related papers (2024-10-13T00:14:08Z) - Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal Regularity [8.28720658988688]
We show that any $(L,alpha)$-H"older function can be approximated to a uniform $mathcalO(dnd/alpha)$, of width $mathcalO(dnd/alpha)$, depth $mathcalO(log(d))$, with $mathcalO(dnd/alpha)$ and $mathcalO(dnd/alpha)$, with $mathcal
arXiv Detail & Related papers (2024-09-18T22:05:07Z) - Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure.
We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$<n>We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Simplifying and Understanding State Space Models with Diagonal Linear
RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs.
We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs.
We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z) - Distributed Saddle-Point Problems Under Similarity [173.19083235638104]
We show that a given suboptimality $epsilon0$ is achieved master/workers networks in $Omegabig.
We then propose algorithms matching the lower bounds either types of networks (up to log-overs)
We assess effectiveness of the proposed algorithms on a robust logistic regression problem.
arXiv Detail & Related papers (2021-07-22T14:25:16Z) - Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes [27.361399036211694]
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
arXiv Detail & Related papers (2021-07-13T05:00:38Z) - Categorical Representation Learning: Morphism is All You Need [0.0]
We provide a construction for categorical representation learning and introduce the foundations of "$textitcategorifier$"
Every object in a dataset $mathcalS$ can be represented as a vector in $mathbbRn$ by an $textitencoding map$ $E: mathcalObj(mathcalS)tomathbbRn$.
As a proof of concept, we provide an example of a text translator equipped with our technology, showing that our categorical learning model outperforms the
arXiv Detail & Related papers (2021-03-26T23:47:15Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z) - Few-Shot Learning via Learning the Representation, Provably [115.7367053639605]
This paper studies few-shot learning via representation learning.
One uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task.
arXiv Detail & Related papers (2020-02-21T17:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.