Related papers: Privately generating tabular data using language models

Privately generating tabular data using language models

URL: http://arxiv.org/abs/2306.04803v1
Date: Wed, 7 Jun 2023 21:53:14 GMT
Title: Privately generating tabular data using language models
Authors: Alexandre Sablayrolles, Yue Wang, Brian Karrer
Abstract summary: Privately generating synthetic data from a table is an important brick of a privacy-first world. We propose and investigate a simple approach of treating each row in a table as a sentence and training a language model with differential privacy.
Score: 80.67328256105891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Privately generating synthetic data from a table is an important brick of a privacy-first world. We propose and investigate a simple approach of treating each row in a table as a sentence and training a language model with differential privacy. We show this approach obtains competitive results in modelling tabular data across multiple datasets, even at small scales that favor alternative methods based on marginal distributions.

Related papers

Assessing Generative Models for Structured Data [0.0]
This paper introduces rigorous methods for assessing synthetic data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting, and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data.
arXiv Detail & Related papers (2025-03-26T18:19:05Z)
LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets. LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets. We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z)
Synthesizing Realistic Data for Table Recognition [4.500373384879752]
We propose a novel method for synthesizing annotation data specifically designed for table recognition. By leveraging the structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset. We have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data.
arXiv Detail & Related papers (2024-04-17T06:36:17Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
TabuLa: Harnessing Language Models for Tabular Data Synthesis [5.102332247789348]
We develop Tabula, a new type of data synthesizer based on the language model structure. We show that Tabula averagely reduces 46.2% training time per epoch compared to current LLMs-based state-of-the-art algorithm. We also propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data.
arXiv Detail & Related papers (2023-10-19T13:50:56Z)
Generating tabular datasets under differential privacy [0.0]
We introduce Differential Privacy (DP) into the training process of deep neural networks. This creates a trade-off between the quality and privacy of the resulting data. We implement novel end-to-end models that leverage attention mechanisms.
arXiv Detail & Related papers (2023-08-28T16:35:43Z)
Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
TabLLM: Few-shot Classification of Tabular Data with Large Language Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification. We evaluate several serialization methods including templates, table-to-text models, and large language models. This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.