HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection
- URL: http://arxiv.org/abs/2408.02927v1
- Date: Tue, 6 Aug 2024 03:21:13 GMT
- Title: HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection
- Authors: Yuxin Wang, Duanyu Feng, Yongfu Dai, Zhengyu Chen, Jimin Huang, Sophia Ananiadou, Qianqian Xie, Hao Wang,
- Abstract summary: In this paper, we introduce a new framework HARMONIC for tabular data generation and evaluation.
Our framework achieves equivalent performance to existing methods with better privacy, which also demonstrates our evaluation framework for the effectiveness of synthetic data and privacy risks.
- Score: 44.225151701532454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data serves as the fundamental foundation for advancing deep learning, particularly tabular data presented in a structured format, which is highly conducive to modeling. However, even in the era of LLM, obtaining tabular data from sensitive domains remains a challenge due to privacy or copyright concerns. Hence, exploring how to effectively use models like LLMs to generate realistic and privacy-preserving synthetic tabular data is urgent. In this paper, we take a step forward to explore LLMs for tabular data synthesis and privacy protection, by introducing a new framework HARMONIC for tabular data generation and evaluation. In the tabular data generation of our framework, unlike previous small-scale LLM-based methods that rely on continued pre-training, we explore the larger-scale LLMs with fine-tuning to generate tabular data and enhance privacy. Based on idea of the k-nearest neighbors algorithm, an instruction fine-tuning dataset is constructed to inspire LLMs to discover inter-row relationships. Then, with fine-tuning, LLMs are trained to remember the format and connections of the data rather than the data itself, which reduces the risk of privacy leakage. In the evaluation part of our framework, we develop specific privacy risk metrics DLT for LLM synthetic data generation, as well as performance evaluation metrics LLE for downstream LLM tasks. Our experiments find that this tabular data generation framework achieves equivalent performance to existing methods with better privacy, which also demonstrates our evaluation framework for the effectiveness of synthetic data and privacy risks in LLM scenarios.
Related papers
- Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset.
Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z) - Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data [19.93652217096443]
Large language models (LLMs) have shown considerable success in a range of domain-specific tasks, especially after fine-tuning.
Fine-tuning with real-world data usually leads to privacy risks, particularly when the fine-tuning samples exist in the pre-training data.
We present an empirical analysis of this underexplored issue by investigating a key question: "Does fine-tuning with LLM-generated data enhance privacy, or does it pose additional privacy risks?"
arXiv Detail & Related papers (2024-09-12T10:14:12Z) - LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis.
Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs.
Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs [31.16117964915814]
Machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs.
To facilitate the development of structural unlearning methods, we propose PISTOL, a pipeline for compiling multi-scenario datasets.
We conduct benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models.
arXiv Detail & Related papers (2024-06-24T17:22:36Z) - SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [27.843894102000608]
We conduct a first systematic review of open datasets for evaluating and improving large language models (LLMs) safety.
We highlight trends, such as a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English datasets.
Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety.
arXiv Detail & Related papers (2024-04-08T10:57:25Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.