Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
- URL: http://arxiv.org/abs/2505.15656v1
- Date: Wed, 21 May 2025 15:32:14 GMT
- Title: Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
- Authors: Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang,
- Abstract summary: Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers.<n>We reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data.
- Score: 77.5835471257498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.
Related papers
- No Query, No Access [50.18709429731724]
We introduce the textbfVictim Data-based Adrial Attack (VDBA), which operates using only victim texts.<n>To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods.<n>Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08%.
arXiv Detail & Related papers (2025-05-12T06:19:59Z) - How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning [2.3759432635713895]
We attack both pre-trained and fine-tuned code language models to investigate the extent of data extractability.<n>Fine-tuning requires fewer resources and is increasingly used by both small and large entities for its effectiveness on specialized data.<n>Data carriers and licensing information are the most likely data to be memorized from pre-trained and fine-tuned models, while the latter is the most likely to be forgotten after fine-tuning.
arXiv Detail & Related papers (2025-01-29T09:17:30Z) - ARMOR: Shielding Unlearnable Examples against Data Augmentation [25.289775916629505]
We propose a framework, dubbed ARMOR, to protect data privacy from potential breaches of data augmentation.<n> ARMOR reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines.
arXiv Detail & Related papers (2025-01-15T15:22:57Z) - PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage [78.33839735526769]
LLMs may be fooled into outputting private information under carefully crafted adversarial prompts.<n>PrivAgent is a novel black-box red-teaming framework for privacy leakage.
arXiv Detail & Related papers (2024-12-07T20:09:01Z) - Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors [15.861833242429228]
We investigate data extraction attacks targeting RAG's knowledge databases.<n>We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs.<n>We propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM.
arXiv Detail & Related papers (2024-11-03T22:27:40Z) - Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models [27.90276817753197]
Diffusion Models (DMs) have become powerful image generation tools.<n>Many people upload fine-tuned checkpoints online, fostering communities such as Civitai and HuggingFace.<n>We ask: "Can training data be extracted from these fine-tuned DMs shared online?"<n>We propose FineXtract, a framework for extracting fine-tuning data.
arXiv Detail & Related papers (2024-10-03T23:06:11Z) - Evaluating LLM-based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) based personal information extraction can be benchmarked.<n>LLM can be misused by attackers to accurately extract various personal information from personal profiles.<n> prompt injection can defend against strong LLM-based attacks, reducing the attack to less effective traditional ones.
arXiv Detail & Related papers (2024-08-14T04:49:30Z) - Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems [22.142588104314175]
We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs)
We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore.
We design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries.
arXiv Detail & Related papers (2024-02-27T19:08:05Z) - Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs)
New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks.
In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z) - ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations.
Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Black-box Dataset Ownership Verification via Backdoor Watermarking [67.69308278379957]
We formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model.
We propose to embed external patterns via backdoor watermarking for the ownership verification to protect them.
Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification.
arXiv Detail & Related papers (2022-08-04T05:32:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.