Related papers: Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

URL: http://arxiv.org/abs/2505.15656v1
Date: Wed, 21 May 2025 15:32:14 GMT
Title: Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
Authors: Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang,
Abstract summary: Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers.<n>We reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data.
Score: 77.5835471257498
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

Related papers

No Query, No Access [50.18709429731724]
We introduce the textbfVictim Data-based Adrial Attack (VDBA), which operates using only victim texts.<n>To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods.<n>Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08%.
arXiv Detail & Related papers (2025-05-12T06:19:59Z)
How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuning [2.3759432635713895]
We attack both pre-trained and fine-tuned code language models to investigate the extent of data extractability.<n>Fine-tuning requires fewer resources and is increasingly used by both small and large entities for its effectiveness on specialized data.<n>Data carriers and licensing information are the most likely data to be memorized from pre-trained and fine-tuned models, while the latter is the most likely to be forgotten after fine-tuning.
arXiv Detail & Related papers (2025-01-29T09:17:30Z)
ARMOR: Shielding Unlearnable Examples against Data Augmentation [25.289775916629505]
We propose a framework, dubbed ARMOR, to protect data privacy from potential breaches of data augmentation.<n> ARMOR reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines.
arXiv Detail & Related papers (2025-01-15T15:22:57Z)
PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage [78.33839735526769]
LLMs may be fooled into outputting private information under carefully crafted adversarial prompts.<n>PrivAgent is a novel black-box red-teaming framework for privacy leakage.
arXiv Detail & Related papers (2024-12-07T20:09:01Z)
Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors [15.861833242429228]
We investigate data extraction attacks targeting RAG's knowledge databases.<n>We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs.<n>We propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM.
arXiv Detail & Related papers (2024-11-03T22:27:40Z)
Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models [27.90276817753197]
Diffusion Models (DMs) have become powerful image generation tools.<n>Many people upload fine-tuned checkpoints online, fostering communities such as Civitai and HuggingFace.<n>We ask: "Can training data be extracted from these fine-tuned DMs shared online?"<n>We propose FineXtract, a framework for extracting fine-tuning data.
arXiv Detail & Related papers (2024-10-03T23:06:11Z)
Evaluating LLM-based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) based personal information extraction can be benchmarked.<n>LLM can be misused by attackers to accurately extract various personal information from personal profiles.<n> prompt injection can defend against strong LLM-based attacks, reducing the attack to less effective traditional ones.
arXiv Detail & Related papers (2024-08-14T04:49:30Z)
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems [22.142588104314175]
We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs) We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore. We design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries.
arXiv Detail & Related papers (2024-02-27T19:08:05Z)
Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs) New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z)
ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
Black-box Dataset Ownership Verification via Backdoor Watermarking [67.69308278379957]
We formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model. We propose to embed external patterns via backdoor watermarking for the ownership verification to protect them. Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification.
arXiv Detail & Related papers (2022-08-04T05:32:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.