OPORP: One Permutation + One Random Projection
- URL: http://arxiv.org/abs/2302.03505v2
- Date: Tue, 23 May 2023 15:03:53 GMT
- Title: OPORP: One Permutation + One Random Projection
- Authors: Ping Li and Xiaoyun Li
- Abstract summary: OPORP uses a variant of the count-sketch'' type of data structures for achieving data reduction/compression.
One crucial step is to normalize the $k$ samples to the unit $l$ norm.
With OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity.
- Score: 37.6593006747285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many
embedding-based retrieval (EBR) applications where the vectors are generated
from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one
permutation + one random projection) uses a variant of the ``count-sketch''
type of data structures for achieving data reduction/compression. With OPORP,
we first apply a permutation on the data vectors. A random vector $r$ is
generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0,
E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors.
Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e.,
sum) the values in each bin to obtain $k$ samples from each data vector. One
crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show
that the estimation variance is essentially: $(s-1)A +
\frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a
function of the data ($u,v$). This formula reveals several key properties: (1)
We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in
reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is a substantial
improvement compared with $\frac{1}{k}(1+\rho^2)$, which corresponds to the
un-normalized estimator. We illustrate that by letting the $k$ in OPORP to be
$k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very
spars random projections'' (VSRP). This immediately leads to a normalized
estimator for VSRP which substantially improves the original estimator of VSRP.
In summary, with OPORP, the two key steps: (i) the normalization and (ii) the
fixed-length binning scheme, have considerably improved the accuracy in
estimating the cosine similarity, which is a routine (and crucial) task in
modern embedding-based retrieval (EBR) applications.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Regression for matrix-valued data via Kronecker products factorization [0.5156484100374059]
We propose an estimation algorithm, termed KRO-PRO-FAC, for estimating the parameters $beta_1k subset Rep times q_1$ and $beta_2k subset Rep times q$.
Numerical studies on simulated and real data indicate that our procedure is competitive, in terms of both estimation error and predictive accuracy, compared to other existing methods.
arXiv Detail & Related papers (2024-04-30T02:44:41Z) - Structure Learning in Graphical Models from Indirect Observations [17.521712510832558]
This paper considers learning of the graphical structure of a $p$-dimensional random vector $X in Rp$ using both parametric and non-parametric methods.
Under mild conditions, we show that our graph-structure estimator can obtain the correct structure.
arXiv Detail & Related papers (2022-05-06T19:24:44Z) - Weighted-average quantile regression [1.0742675209112622]
We introduce the weighted-average quantile regression framework, $int_Y|X(u)psi(u)du = X'beta$, where $Y$ is a dependent variable.
We develop an estimator of the vector of parameters $beta$, where $T$ is the size of available sample.
arXiv Detail & Related papers (2022-03-06T19:06:53Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Sparse sketches with small inversion bias [79.77110958547695]
Inversion bias arises when averaging estimates of quantities that depend on the inverse covariance.
We develop a framework for analyzing inversion bias, based on our proposed concept of an $(epsilon,delta)$-unbiased estimator for random matrices.
We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, the estimator $(epsilon,delta)$-unbiased for $(Atop A)-1$ with a sketch of size $m=O(d+sqrt d/
arXiv Detail & Related papers (2020-11-21T01:33:15Z) - Sharp threshold for alignment of graph databases with Gaussian weights [1.6752182911522522]
We study the fundamental limits for reconstruction in weighted graph (or matrix) database alignment.
We prove that there is a sharp threshold for exact recovery of $pi*$.
arXiv Detail & Related papers (2020-10-30T14:42:17Z) - Truncated Linear Regression in High Dimensions [26.41623833920794]
In truncated linear regression, $(A_i, y_i)_i$ whose dependent variable equals $y_i= A_irm T cdot x* + eta_i$ is some fixed unknown vector of interest.
The goal is to recover $x*$ under some favorable conditions on the $A_i$'s and the noise distribution.
We prove that there exists a computationally and statistically efficient method for recovering $k$-sparse $n$-dimensional vectors $x*$ from $m$ truncated samples.
arXiv Detail & Related papers (2020-07-29T00:31:34Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.