Privately generating tabular data using language models
        - URL: http://arxiv.org/abs/2306.04803v1
- Date: Wed, 7 Jun 2023 21:53:14 GMT
- Title: Privately generating tabular data using language models
- Authors: Alexandre Sablayrolles, Yue Wang, Brian Karrer
- Abstract summary: Privately generating synthetic data from a table is an important brick of a privacy-first world.
We propose and investigate a simple approach of treating each row in a table as a sentence and training a language model with differential privacy.
- Score: 80.67328256105891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Privately generating synthetic data from a table is an important brick of a
privacy-first world. We propose and investigate a simple approach of treating
each row in a table as a sentence and training a language model with
differential privacy. We show this approach obtains competitive results in
modelling tabular data across multiple datasets, even at small scales that
favor alternative methods based on marginal distributions.
 
      
        Related papers
        - Make Still Further Progress: Chain of Thoughts for Tabular Data   Leaderboard [27.224577475861214]
 Tabular data, a fundamental data format in machine learning, is predominantly utilized in competitions and real-world applications.<n>We propose an in-context ensemble framework for tabular prediction that leverages large language models.<n>Our method constructs a context around each test instance using its nearest neighbors and the predictions from a pool of external models.
 arXiv  Detail & Related papers  (2025-05-19T17:52:58Z)
- Assessing Generative Models for Structured Data [0.0]
 This paper introduces rigorous methods for assessing synthetic data against real data by looking at inter-column dependencies within the data.
We find that large language models (GPT-2), both when queried via few-shot prompting, and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data.
 arXiv  Detail & Related papers  (2025-03-26T18:19:05Z)
- A Closer Look at Deep Learning Methods on Tabular Datasets [52.50778536274327]
 Tabular data is prevalent across diverse domains in machine learning.<n>Deep Neural Network (DNN)-based methods have recently demonstrated promising performance.<n>We compare 32 state-of-the-art deep and tree-based methods, evaluating their average performance across multiple criteria.
 arXiv  Detail & Related papers  (2024-07-01T04:24:07Z)
- LaTable: Towards Large Tabular Models [63.995130144110156]
 Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
 arXiv  Detail & Related papers  (2024-06-25T16:03:50Z)
- Synthesizing Realistic Data for Table Recognition [4.500373384879752]
 We propose a novel method for synthesizing annotation data specifically designed for table recognition.
By leveraging the structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset.
We have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data.
 arXiv  Detail & Related papers  (2024-04-17T06:36:17Z)
- Training-Free Generalization on Heterogeneous Tabular Data via
  Meta-Representation [67.30538142519067]
 We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
 arXiv  Detail & Related papers  (2023-10-31T18:03:54Z)
- TabuLa: Harnessing Language Models for Tabular Data Synthesis [5.102332247789348]
 We develop Tabula, a new type of data synthesizer based on the language model structure.
We show that Tabula averagely reduces 46.2% training time per epoch compared to current LLMs-based state-of-the-art algorithm.
We also propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data.
 arXiv  Detail & Related papers  (2023-10-19T13:50:56Z)
- Generating tabular datasets under differential privacy [0.0]
 We introduce Differential Privacy (DP) into the training process of deep neural networks.
This creates a trade-off between the quality and privacy of the resulting data.
We implement novel end-to-end models that leverage attention mechanisms.
 arXiv  Detail & Related papers  (2023-08-28T16:35:43Z)
- Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
 We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
 arXiv  Detail & Related papers  (2023-05-16T06:37:38Z)
- Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
 Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
 arXiv  Detail & Related papers  (2022-12-19T20:46:43Z)
- TabLLM: Few-shot Classification of Tabular Data with Large Language
  Models [66.03023402174138]
 We study the application of large language models to zero-shot and few-shot classification.
We evaluate several serialization methods including templates, table-to-text models, and large language models.
This approach is also competitive with strong traditional baselines like gradient-boosted trees.
 arXiv  Detail & Related papers  (2022-10-19T17:08:13Z)
- GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
 We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
 arXiv  Detail & Related papers  (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.