Tabular Data Adapters: Improving Outlier Detection for Unlabeled Private Data
- URL: http://arxiv.org/abs/2504.20862v1
- Date: Tue, 29 Apr 2025 15:38:43 GMT
- Title: Tabular Data Adapters: Improving Outlier Detection for Unlabeled Private Data
- Authors: Dayananda Herurkar, Jörn Hees, Vesselin Tzvetkov, Andreas Dengel,
- Abstract summary: We introduce Tabular Data Adapters (TDA), a novel method for generating soft labels for unlabeled data in outlier detection tasks.<n>Our approach offers a scalable, efficient, and cost-effective solution, to bridge the gap between public research models and real-world industrial applications.
- Score: 12.092540602813333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable success of Deep Learning approaches is often based and demonstrated on large public datasets. However, when applying such approaches to internal, private datasets, one frequently faces challenges arising from structural differences in the datasets, domain shift, and the lack of labels. In this work, we introduce Tabular Data Adapters (TDA), a novel method for generating soft labels for unlabeled tabular data in outlier detection tasks. By identifying statistically similar public datasets and transforming private data (based on a shared autoencoder) into a format compatible with state-of-the-art public models, our approach enables the generation of weak labels. It thereby can help to mitigate the cold start problem of labeling by basing on existing outlier detection models for public datasets. In experiments on 50 tabular datasets across different domains, we demonstrate that our method is able to provide more accurate annotations than baseline approaches while reducing computational time. Our approach offers a scalable, efficient, and cost-effective solution, to bridge the gap between public research models and real-world industrial applications.
Related papers
- Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data [10.1687640711587]
This work introduces the notion of "surrogate" public data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata.<n>We automate the process of generating surrogate public data with large language models (LLMs)<n>In particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records.
arXiv Detail & Related papers (2025-04-19T17:55:10Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)<n>The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.<n>Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Combining Public Human Activity Recognition Datasets to Mitigate Labeled
Data Scarcity [1.274578243851308]
We propose a novel strategy to combine publicly available datasets with the goal of learning a generalized HAR model.
Our experimental evaluation, which includes experimenting with different state-of-the-art neural network architectures, shows that combining public datasets can significantly reduce the number of labeled samples.
arXiv Detail & Related papers (2023-06-23T18:51:22Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Deep Active Learning with Budget Annotation [0.0]
We propose a hybrid approach of computing both the uncertainty and informativeness of an instance.
We employ the state-of-the-art pre-trained models in order to avoid querying information already contained in those models.
arXiv Detail & Related papers (2022-07-31T20:20:44Z) - Label-Assemble: Leveraging Multiple Datasets with Partial Labels [68.46767639240564]
"Label-Assemble" aims to unleash the full potential of partial labels from an assembly of public datasets.
We discovered that learning from negative examples facilitates both computer-aided disease diagnosis and detection.
arXiv Detail & Related papers (2021-09-25T02:48:17Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Low-Budget Label Query through Domain Alignment Enforcement [48.06803561387064]
We tackle a new problem named low-budget label query.
We first improve an Unsupervised Domain Adaptation (UDA) method to better align source and target domains.
We then propose a simple yet effective selection method based on uniform sampling of the prediction consistency distribution.
arXiv Detail & Related papers (2020-01-01T16:52:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.