Related papers: STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables

STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables

URL: http://arxiv.org/abs/2303.00918v1
Date: Thu, 2 Mar 2023 02:37:54 GMT
Title: STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables
Authors: Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, Jinwoo Shin
Abstract summary: We propose a framework for few-shot semi-supervised learning, coined Self-generated Tasks from UNlabeled Tables (STUNT) Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks.
Score: 64.0903766169603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning with few labeled tabular samples is often an essential requirement for industrial machine learning applications as varieties of tabular data suffer from high annotation costs or have difficulties in collecting new samples for novel tasks. Despite the utter importance, such a problem is quite under-explored in the field of tabular learning, and existing few-shot learning schemes from other domains are not straightforward to apply, mainly due to the heterogeneous characteristics of tabular data. In this paper, we propose a simple yet effective framework for few-shot semi-supervised tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT). Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks. Moreover, we introduce an unsupervised validation scheme for hyperparameter search (and early stopping) by generating a pseudo-validation set using STUNT from unlabeled data. Our experimental results demonstrate that our simple framework brings significant performance gain under various tabular few-shot learning benchmarks, compared to prior semi- and self-supervised baselines. Code is available at https://github.com/jaehyun513/STUNT.

Related papers

Tabular Transfer Learning via Prompting LLMs [52.96022335067357]
We propose a novel framework, Prompt to Transfer (P2T), that utilizes unlabeled (or heterogeneous) source data with large language models (LLMs) P2T identifies a column feature in a source dataset that is strongly correlated with a target task feature to create examples relevant to the target task, thus creating pseudo-demonstrations for prompts.
arXiv Detail & Related papers (2024-08-09T11:30:52Z)
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over. We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z)
An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models [55.01592097059969]
Supervised finetuning on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool. We propose using experimental design to circumvent the computational bottlenecks of active learning.
arXiv Detail & Related papers (2024-01-12T16:56:54Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Deep Active Learning with Contrastive Learning Under Realistic Data Pool Assumptions [2.578242050187029]
Active learning aims to identify the most informative data from an unlabeled data pool that enables a model to reach the desired accuracy rapidly. Most existing active learning methods have been evaluated in an ideal setting where only samples relevant to the target task exist in an unlabeled data pool. We introduce new active learning benchmarks that include ambiguous, task-irrelevant out-of-distribution as well as in-distribution samples.
arXiv Detail & Related papers (2023-03-25T10:46:10Z)
An Embarrassingly Simple Approach to Semi-Supervised Few-Shot Learning [58.59343434538218]
We propose a simple but quite effective approach to predict accurate negative pseudo-labels of unlabeled data from an indirect learning perspective. Our approach can be implemented in just few lines of code by only using off-the-shelf operations.
arXiv Detail & Related papers (2022-09-28T02:11:34Z)
Exploiting Diversity of Unlabeled Data for Label-Efficient Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling. We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting. Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z)
Using Self-Supervised Pretext Tasks for Active Learning [7.214674613451605]
We propose a novel active learning approach that utilizes self-supervised pretext tasks and a unique data sampler to select data that are both difficult and representative. The pretext task learner is trained on the unlabeled set, and the unlabeled data are sorted and grouped into batches by their pretext task losses. In each iteration, the main task model is used to sample the most uncertain data in a batch to be annotated.
arXiv Detail & Related papers (2022-01-19T07:58:06Z)
BAMLD: Bayesian Active Meta-Learning by Disagreement [39.59987601426039]
This paper introduces an information-theoretic active task selection mechanism to decrease the number of labeling requests for meta-training tasks. We report its empirical performance results that compare favourably against existing acquisition mechanisms.
arXiv Detail & Related papers (2021-10-19T13:06:51Z)
Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation [58.64720318755764]
Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) has enabled compressing deep networks and ensembles, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples. We present a general framework called "generate, annotate, and learn (GAL)" that uses unconditional generative models to synthesize in-domain unlabeled data.
arXiv Detail & Related papers (2021-06-11T05:01:24Z)
Deep Active Learning via Open Set Recognition [0.0]
In many applications, data is easy to acquire but expensive and time-consuming to label prominent examples. We formulate active learning as an open-set recognition problem. Unlike current active learning methods, our algorithm can learn tasks without the need for task labels.
arXiv Detail & Related papers (2020-07-04T22:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.