Related papers: Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information

URL: http://arxiv.org/abs/2511.13049v2
Date: Fri, 21 Nov 2025 12:07:39 GMT
Title: Generalization Bounds for Semi-supervised Matrix Completion with Distributional Side Information
Authors: Antoine Ledent, Mun Chong Soo, Nong Minh Hieu,
Abstract summary: We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices.<n>We show that the true generalization error splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $ground$ respectively.
Score: 14.149880038429485
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study a matrix completion problem where both the ground truth $R$ matrix and the unknown sampling distribution $P$ over observed entries are low-rank matrices, and \textit{share a common subspace}. We assume that a large amount $M$ of \textit{unlabeled} data drawn from the sampling distribution $P$ is available, together with a small amount $N$ of labeled data drawn from the same distribution and noisy estimates of the corresponding ground truth entries. This setting is inspired by recommender systems scenarios where the unlabeled data corresponds to `implicit feedback' (consisting in interactions such as purchase, click, etc. ) and the labeled data corresponds to the `explicit feedback', consisting of interactions where the user has given an explicit rating to the item. Leveraging powerful results from the theory of low-rank subspace recovery, together with classic generalization bounds for matrix completion models, we show error bounds consisting of a sum of two error terms scaling as $\widetilde{O}\left(\sqrt{\frac{nd}{M}}\right)$ and $\widetilde{O}\left(\sqrt{\frac{dr}{N}}\right)$ respectively, where $d$ is the rank of $P$ and $r$ is the rank of $M$. In synthetic experiments, we confirm that the true generalization error naturally splits into independent error terms corresponding to the estimations of $P$ and and the ground truth matrix $\ground$ respectively. In real-life experiments on Douban and MovieLens with most explicit ratings removed, we demonstrate that the method can outperform baselines relying only on the explicit ratings, demonstrating that our assumptions provide a valid toy theoretical setting to study the interaction between explicit and implicit feedbacks in recommender systems.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Harmful Overfitting in Sobolev Spaces [49.47221415754556]
We study the generalization behavior of functions in Sobolev spaces $Wk, p(mathbbRd)$ that perfectly fit a noisy training data set.<n>We show that approximately norm-minimizing interpolators, which are canonical solutions selected by smoothness bias, exhibit harmful overfitting.<n>Our proof uses a geometric argument which identifies harmful neighborhoods of the training data using Sobolev inequalities.
arXiv Detail & Related papers (2026-01-31T17:40:56Z)
One-Sided Matrix Completion from Ultra-Sparse Samples [18.28432289831719]
We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency.<n>We show that the estimator is unbiased for any $p$ and enjoys low variance.<n>Experiments on both synthetic and real-world data validate our approach.
arXiv Detail & Related papers (2026-01-18T01:33:48Z)
Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [72.69498649272347]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z)
Semidefinite programming relaxations and debiasing for MAXCUT-based clustering [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of 2 sub-gaussian distributions in $mathbbRp$.<n>We use semidefinite programming relaxations of an integer quadratic program that is formulated as finding the maximum cut on a graph.
arXiv Detail & Related papers (2024-01-16T03:14:24Z)
Out-Of-Domain Unlabeled Data Improves Generalization [0.7589678255312519]
We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems. We show that unlabeled samples can be harnessed to narrow the generalization gap. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.
arXiv Detail & Related papers (2023-09-29T02:00:03Z)
Kernel-Based Tests for Likelihood-Free Hypothesis Testing [21.143798051525646]
Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to emphone of the two classes. Special cases of this problem are well-known; when $m=1$ it corresponds to binary classification; and when $mapprox n$ it is equivalent to two-sample testing. In recent work it was discovered that there is a fundamental trade-off between $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulation data needed.
arXiv Detail & Related papers (2023-08-17T15:24:03Z)
Tackling Combinatorial Distribution Shift: A Matrix Completion Perspective [42.85196869759168]
We study a setting we call distribution shift, where (a) under the test- and training-random data, the labels $z$ are determined by pairs of features $(x,y)$, (b) the training distribution has coverage of certain marginal distributions over $x$ and $y$ separately, but (c) the test distribution involves examples from a product distribution over $(x,y)$ that is not covered by the training distribution.
arXiv Detail & Related papers (2023-07-12T21:17:47Z)
How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm? [73.80001705134147]
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset.
arXiv Detail & Related papers (2022-10-15T04:11:56Z)
Low-rank Matrix Recovery With Unknown Correspondence [62.634051913953485]
We show that it is possible to recover $M$ via solving a nuclear norm minimization problem under a proper low-rank condition on $M$, with provable non-asymptotic error bound for the recovery of $M$. Experiments on simulated data, the MovieLens 100K dataset and Yale B database show that $textM3textO achieves state-of-the-art performance over several baselines and can recover the ground-truth correspondence with high accuracy.
arXiv Detail & Related papers (2021-10-15T09:27:50Z)
Leveraged Matrix Completion with Noise [84.20092979053119]
We show that we can provably recover an unknown $ntimes n$ matrix of rank $r$ from just about $mathcalO(nrlog2 (n))$ entries. Our proofs are supported by a novel approach that phrases sufficient optimality conditions based on the Golfing Scheme.
arXiv Detail & Related papers (2020-11-11T16:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.