Related papers: When and How Unlabeled Data Provably Improve In-Context Learning

When and How Unlabeled Data Provably Improve In-Context Learning

URL: http://arxiv.org/abs/2506.15329v1
Date: Wed, 18 Jun 2025 10:01:17 GMT
Title: When and How Unlabeled Data Provably Improve In-Context Learning
Authors: Yingcong Li, Xiangyu Chang, Muti Kara, Xiaofeng Liu, Amit Roy-Chowdhury, Samet Oymak,
Abstract summary: In-supervised learning can be effective even when demonstrations have missing or incorrect labels.<n>We show that multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $sum_ige 0 a_i (Xtop X)iXtop y$ with $X$ and $y$ denoting features and partially-observed labels.
Score: 31.201385551730926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent research shows that in-context learning (ICL) can be effective even when demonstrations have missing or incorrect labels. To shed light on this capability, we examine a canonical setting where the demonstrations are drawn according to a binary Gaussian mixture model (GMM) and a certain fraction of the demonstrations have missing labels. We provide a comprehensive theoretical study to show that: (1) The loss landscape of one-layer linear attention models recover the optimal fully-supervised estimator but completely fail to exploit unlabeled data; (2) In contrast, multilayer or looped transformers can effectively leverage unlabeled data by implicitly constructing estimators of the form $\sum_{i\ge 0} a_i (X^\top X)^iX^\top y$ with $X$ and $y$ denoting features and partially-observed labels (with missing entries set to zero). We characterize the class of polynomials that can be expressed as a function of depth and draw connections to Expectation Maximization, an iterative pseudo-labeling algorithm commonly used in semi-supervised learning. Importantly, the leading polynomial power is exponential in depth, so mild amount of depth/looping suffices. As an application of theory, we propose looping off-the-shelf tabular foundation models to enhance their semi-supervision capabilities. Extensive evaluations on real-world datasets show that our method significantly improves the semisupervised tabular learning performance over the standard single pass inference.

Related papers

Graph-Based Semi-Supervised Segregated Lipschitz Learning [0.21847754147782888]
This paper presents an approach to semi-supervised learning for the classification of data using the Lipschitz Learning on graphs. We develop a graph-based semi-supervised learning framework that leverages the properties of the infinity Laplacian to propagate labels in a dataset where only a few samples are labeled.
arXiv Detail & Related papers (2024-11-05T17:16:56Z)
Dual-Decoupling Learning and Metric-Adaptive Thresholding for Semi-Supervised Multi-Label Learning [81.83013974171364]
Semi-supervised multi-label learning (SSMLL) is a powerful framework for leveraging unlabeled data to reduce the expensive cost of collecting precise multi-label annotations.<n>Unlike semi-supervised learning, one cannot select the most probable label as the pseudo-label in SSMLL due to multiple semantics contained in an instance.<n>We propose a dual-perspective method to generate high-quality pseudo-labels.
arXiv Detail & Related papers (2024-07-26T09:33:53Z)
Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels. Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z)
Learning Sparsity of Representations with Discrete Latent Variables [15.05207849434673]
We propose a sparse deep latent generative model SDLGM to explicitly model degree of sparsity. The resulting sparsity of a representation is not fixed, but fits to the observation itself under the pre-defined restriction. For inference and learning, we develop an amortized variational method based on MC gradient estimator.
arXiv Detail & Related papers (2023-04-03T12:47:18Z)
The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning [32.15608637930748]
We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously. We provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks, it puts less emphasis on task-specific features.
arXiv Detail & Related papers (2023-02-28T22:14:33Z)
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity [71.11795737362459]
ViTs with self-attention modules have recently achieved great empirical success in many tasks. However, theoretical learning generalization analysis is mostly noisy and elusive. This paper provides the first theoretical analysis of a shallow ViT for a classification task.
arXiv Detail & Related papers (2023-02-12T22:12:35Z)
Bayesian Graph Contrastive Learning [55.36652660268726]
We propose a novel perspective of graph contrastive learning methods showing random augmentations leads to encoders. Our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector. We show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-12-15T01:45:32Z)
Improving Contrastive Learning on Imbalanced Seed Data via Open-World Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK) MAK follows three simple principles: tailness, proximity, and diversity. We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z)
Model-Change Active Learning in Graph-Based Semi-Supervised Learning [5.174023161939957]
"Model Change" active learning quantifies the resulting change by introducing the additional label(s) We consider a family of convex loss functions for which the acquisition function can be efficiently approximated using the Laplace approximation of the posterior distribution.
arXiv Detail & Related papers (2021-10-14T21:47:10Z)
Semi-Supervised Learning with Meta-Gradient [123.26748223837802]
We propose a simple yet effective meta-learning algorithm in semi-supervised learning. We find that the proposed algorithm performs favorably against state-of-the-art methods.
arXiv Detail & Related papers (2020-07-08T08:48:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.