Related papers: Differentiation-Based Extraction of Proprietary Data from Fine-Tuned LLMs

Differentiation-Based Extraction of Proprietary Data from Fine-Tuned LLMs

URL: http://arxiv.org/abs/2506.17353v1
Date: Fri, 20 Jun 2025 02:43:36 GMT
Title: Differentiation-Based Extraction of Proprietary Data from Fine-Tuned LLMs
Authors: Zongjie Li, Daoyuan Wu, Shuai Wang, Zhendong Su,
Abstract summary: This paper studies the critical research problem of extracting data fromSupervised Fine-Tuning (SFT) datasets.<n>We develop a novel extraction method specifically designed for SFT models, called Differentiated Data Extraction (DDE)<n>Our results show that DDE consistently outperforms existing extraction baselines in all attack settings.
Score: 13.835835256858653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing demand for domain-specific and human-aligned Large Language Models (LLMs) has led to the widespread adoption of Supervised Fine-Tuning (SFT) techniques. SFT datasets often comprise valuable instruction-response pairs, making them highly valuable targets for potential extraction. This paper studies this critical research problem for the first time. We start by formally defining and formulating the problem, then explore various attack goals, types, and variants based on the unique properties of SFT data in real-world scenarios. Based on our analysis of extraction behaviors of direct extraction, we develop a novel extraction method specifically designed for SFT models, called Differentiated Data Extraction (DDE), which exploits the confidence levels of fine-tuned models and their behavioral differences from pre-trained base models. Through extensive experiments across multiple domains and scenarios, we demonstrate the feasibility of SFT data extraction using DDE. Our results show that DDE consistently outperforms existing extraction baselines in all attack settings. To counter this new attack, we propose a defense mechanism that mitigates DDE attacks with minimal impact on model performance. Overall, our research reveals hidden data leak risks in fine-tuned LLMs and provides insights for developing more secure models.

Related papers

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality [10.74213785908381]
Supervised fine-tuning (SFT) is a critical step in aligning large language models with human instructions and values.<n>We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks.<n>We will release these 1,000+ SFT models and benchmark results to accelerate further research.
arXiv Detail & Related papers (2025-06-17T16:13:15Z)
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting [1.5595148909011116]
We propose a novel, more cost-effectiveSupervised Fine-Tuning (SFT) method which could effectively reduce the risk of catastrophic forgetting without access to original SFT data.<n> Experimental results demonstrate that our method preserves generalization capabilities in general domains while improving task-specific performance.
arXiv Detail & Related papers (2025-06-11T06:23:50Z)
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment [80.04449725137177]
Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback.<n>Our study reveals a striking, safety-specific phenomenon associated with DPO alignment.<n>Using solely self-generated responses for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models.
arXiv Detail & Related papers (2025-04-03T00:36:40Z)
Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training [49.8035317670223]
A scientific foundation model (SciFM) is emerging as a promising tool for learning transferable representations across diverse domains.<n>We propose incorporating PDE residuals into pre-training either as the sole learning signal or in combination with data loss to compensate for limited or infeasible training data.<n>Our results show that pre-training with PDE constraints significantly enhances generalization, outperforming models trained solely on solution data.
arXiv Detail & Related papers (2025-03-24T19:12:39Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
SIDE: Surrogate Conditional Data Extraction from Diffusion Models [32.18993348942877]
We present textbfSurrogate condItional Data Extraction (SIDE), a framework that constructs data-driven surrogate conditions to enable targeted extraction from any DPM.<n>We show that SIDE can successfully extract training data from so-called safe unconditional models, outperforming baseline attacks even on conditional models.<n>Our work redefines the threat landscape for DPMs, establishing precise conditioning as a fundamental vulnerability and setting a new, stronger benchmark for model privacy evaluation.
arXiv Detail & Related papers (2024-10-03T13:17:06Z)
Unified Source-Free Domain Adaptation [41.90015041165936]
We propose a novel approach latent Causal factors discovery for unified SFDA.<n>In contrast to previous alternatives that emphasize learning the statistical description of reality, we formulate CausalDA from a causality perspective.<n>To integrate extensive world knowledge, we leverage a pre-trained vision-language model such as CLIP.
arXiv Detail & Related papers (2024-03-12T12:40:08Z)
Risk-Sensitive Diffusion: Robustly Optimizing Diffusion Models with Noisy Samples [58.68233326265417]
Non-image data are prevalent in real applications and tend to be noisy. Risk-sensitive SDE is a type of differential equation (SDE) parameterized by the risk vector. We conduct systematic studies for both Gaussian and non-Gaussian noise distributions.
arXiv Detail & Related papers (2024-02-03T08:41:51Z)
Expanding Expressiveness of Diffusion Models with Limited Data via Self-Distillation based Fine-Tuning [24.791783885165923]
Training diffusion models on limited datasets poses challenges in terms of limited generation capacity and expressiveness. We propose Self-Distillation for Fine-Tuning diffusion models (SDFT) to address these challenges.
arXiv Detail & Related papers (2023-11-02T06:24:06Z)
SCME: A Self-Contrastive Method for Data-free and Query-Limited Model Extraction Attack [18.998300969035885]
Model extraction attacks fool the target model by generating adversarial examples on a substitute model. We propose a novel data-free model extraction method named SCME, which considers both the inter- and intra-class diversity in synthesizing fake data.
arXiv Detail & Related papers (2023-10-15T10:41:45Z)
From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models. textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z)
Data Forensics in Diffusion Models: A Systematic Analysis of Membership Privacy [62.16582309504159]
We develop a systematic analysis of membership inference attacks on diffusion models and propose novel attack methods tailored to each attack scenario. Our approach exploits easily obtainable quantities and is highly effective, achieving near-perfect attack performance (>0.9 AUCROC) in realistic scenarios.
arXiv Detail & Related papers (2023-02-15T17:37:49Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.