On the Emergence of Linear Analogies in Word Embeddings
- URL: http://arxiv.org/abs/2505.18651v1
- Date: Sat, 24 May 2025 11:42:26 GMT
- Title: On the Emergence of Linear Analogies in Word Embeddings
- Authors: Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart,
- Abstract summary: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora.<n>We introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions.
- Score: 5.440589713820591
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.
Related papers
- Simple Convergence Proof of Adam From a Sign-like Descent Perspective [58.89890024903816]
We show that Adam achieves the optimal rate of $cal O(frac1Ts14)$ rather than the previous $cal O(fracln TTs14)$.<n>Our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence.
arXiv Detail & Related papers (2025-07-08T13:19:26Z) - Bivariate Matrix-valued Linear Regression (BMLR): Finite-sample performance under Identifiability and Sparsity Assumptions [0.0]
We study the estimation of parameters in a matrix-valued linear regression model, where the $T$ responses $(Y_t)_t=1T in mathbbRn times p$ and predictors $(X_t)_t=1T in mathbbRm times q$.<n>We propose explicit optimization-free estimators and establish non-asymptotic convergence rates to quantify their performance.
arXiv Detail & Related papers (2024-12-23T18:03:34Z) - Learnable Similarity and Dissimilarity Guided Symmetric Non-Negative Matrix Factorization [18.53944578996308]
We construct a weighted $k$-NN graph with learnable weight that reflects the reliability of each $k$-th NN.<n>To obtain a discriminative similarity matrix, we introduce a dissimilarity matrix with a dual structure of the similarity matrix.<n>An efficient alternative optimization algorithm is designed to solve the proposed model.
arXiv Detail & Related papers (2024-12-05T11:32:53Z) - Analysing heavy-tail properties of Stochastic Gradient Descent by means of Stochastic Recurrence Equations [0.0]
In recent works, it has been observed that heavy tail properties of Gradient Descent (SGD) can be studied in the probabilistic framework of recursions.
We will answer several open questions of the quoted paper and extend their results by applying the theory of irreducibleproximal (i-p) matrices.
arXiv Detail & Related papers (2024-03-20T13:39:19Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Repeated Observations for Classification [0.2676349883103404]
We study the problem nonparametric classification with repeated observations.
In the analysis, we investigate particular models like robust detection by nominal densities, prototype classification, linear transformation, linear classification, scaling.
arXiv Detail & Related papers (2023-07-19T10:50:36Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Near-optimal fitting of ellipsoids to random points [68.12685213894112]
A basic problem of fitting an ellipsoid to random points has connections to low-rank matrix decompositions, independent component analysis, and principal component analysis.
We resolve this conjecture up to logarithmic factors by constructing a fitting ellipsoid for some $n = Omega(, d2/mathrmpolylog(d),)$.
Our proof demonstrates feasibility of the least squares construction of Saunderson et al. using a convenient decomposition of a certain non-standard random matrix.
arXiv Detail & Related papers (2022-08-19T18:00:34Z) - On the Equivalence of Causal Models: A Category-Theoretic Approach [0.0]
We develop a criterion for determining the equivalence of causal models having different but homomorphic directed acyclic graphs over discrete variables.
The equivalence of causal models is then defined in terms of a natural transformation or isomorphism between two such functors.
It is shown that when one model is a $Phi$-abstraction of another, the intervention of the former can be consistently translated into that of the latter.
arXiv Detail & Related papers (2022-01-18T13:43:06Z) - When Random Tensors meet Random Matrices [50.568841545067144]
This paper studies asymmetric order-$d$ spiked tensor models with Gaussian noise.
We show that the analysis of the considered model boils down to the analysis of an equivalent spiked symmetric textitblock-wise random matrix.
arXiv Detail & Related papers (2021-12-23T04:05:01Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Stochastic behavior of outcome of Schur-Weyl duality measurement [45.41082277680607]
We focus on the measurement defined by the decomposition based on Schur-Weyl duality on $n$ qubits.
We derive various types of distribution including a kind of central limit when $n$ goes to infinity.
arXiv Detail & Related papers (2021-04-26T15:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.