Related papers: Knowledge Sanitization of Large Language Models

Knowledge Sanitization of Large Language Models

URL: http://arxiv.org/abs/2309.11852v2
Date: Sat, 2 Mar 2024 08:50:40 GMT
Title: Knowledge Sanitization of Large Language Models
Authors: Yoichi Ishibashi, Hidetoshi Shimodaira
Abstract summary: Large language models (LLMs) trained on a large corpus of Web data can potentially reveal sensitive or confidential information. Our technique efficiently fine-tunes these models using the Low-Rank Adaptation (LoRA) method. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLMs.
Score: 4.722882736419499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We explore a knowledge sanitization approach to mitigate the privacy concerns associated with large language models (LLMs). LLMs trained on a large corpus of Web data can memorize and potentially reveal sensitive or confidential information, raising critical security concerns. Our technique efficiently fine-tunes these models using the Low-Rank Adaptation (LoRA) method, prompting them to generate harmless responses such as ``I don't know'' when queried about specific information. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLMs. These two advantages strengthen the defense against extraction attacks and reduces the emission of harmful content such as hallucinations.

Related papers

Effective LLM Knowledge Learning via Model Generalization [73.16975077770765]
Large language models (LLMs) are trained on enormous documents that contain extensive world knowledge. It is still not well-understood how knowledge is acquired via autoregressive pre-training. In this paper, we focus on understanding and improving LLM knowledge learning.
arXiv Detail & Related papers (2025-03-05T17:56:20Z)
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router [42.222681564769076]
We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models. HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
arXiv Detail & Related papers (2024-10-03T17:10:41Z)
Robustifying Safety-Aligned Large Language Models through Clean Data Curation [11.273749179260468]
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
arXiv Detail & Related papers (2024-05-24T04:50:38Z)
Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations. One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) [56.67603627046346]
Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data. In this work, we conduct empirical studies with novel attack methods, which demonstrate the vulnerability of RAG systems on leaking the private retrieval database.
arXiv Detail & Related papers (2024-02-23T18:35:15Z)
Towards Safer Large Language Models through Machine Unlearning [19.698620794387338]
Selective Knowledge Unlearning ( SKU) is designed to eliminate harmful knowledge while preserving utility on normal prompts. First stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. Our experiments demonstrate that SKU identifies a good balance point between removing harmful information and preserving utility.
arXiv Detail & Related papers (2024-02-15T16:28:34Z)
Knowledge Verification to Nip Hallucination in the Bud [69.79051730580014]
We demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs. We propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge. We demonstrate the superior efficacy of KCA in reducing hallucinations across six benchmarks, utilizing foundation LLMs of varying backbones and scales.
arXiv Detail & Related papers (2024-01-19T15:39:49Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z)
On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
Mitigating Approximate Memorization in Language Models via Dissimilarity Learned Policy [0.0]
Large Language models (LLMs) are trained on large amounts of data. LLMs showed to memorize parts of the training data and emit those data verbatim when an adversary prompts appropriately.
arXiv Detail & Related papers (2023-05-02T15:53:28Z)
Knowledge Unlearning for Mitigating Privacy Risks in Language Models [31.322818016245087]
We propose knowledge unlearning as an alternative method to reduce privacy risks for language models. We show that simply applying the unlikelihood training objective to target token sequences is effective at forgetting them. We show that unlearning can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori.
arXiv Detail & Related papers (2022-10-04T10:18:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.