DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
- URL: http://arxiv.org/abs/2511.05784v2
- Date: Wed, 12 Nov 2025 01:24:14 GMT
- Title: DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
- Authors: Yaxuan Wang, Chris Yuhao Liu, Quan Liu, Jinglong Pang, Wei Wei, Yujia Bao, Yang Liu,
- Abstract summary: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge.<n>Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities.<n>We propose Detect-Reasoning Augmented GeneratiON (DRAGON) to overcome these limitations.
- Score: 15.58340591381191
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unlearning in Large Language Models (LLMs) is crucial for protecting private data and removing harmful knowledge. Most existing approaches rely on fine-tuning to balance unlearning efficiency with general language capabilities. However, these methods typically require training or access to retain data, which is often unavailable in real world scenarios. Although these methods can perform well when both forget and retain data are available, few works have demonstrated equivalent capability in more practical, data-limited scenarios. To overcome these limitations, we propose Detect-Reasoning Augmented GeneratiON (DRAGON), a systematic, reasoning-based framework that utilizes in-context chain-of-thought (CoT) instructions to guard deployed LLMs before inference. Instead of modifying the base model, DRAGON leverages the inherent instruction-following ability of LLMs and introduces a lightweight detection module to identify forget-worthy prompts without any retain data. These are then routed through a dedicated CoT guard model to enforce safe and accurate in-context intervention. To robustly evaluate unlearning performance, we introduce novel metrics for unlearning performance and the continual unlearning setting. Extensive experiments across three representative unlearning tasks validate the effectiveness of DRAGON, demonstrating its strong unlearning capability, scalability, and applicability in practical scenarios.
Related papers
- Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs [31.768387661474904]
Unlearning in large language models (LLMs) involves precisely removing specific information from a pre-trained model.<n>This is crucial to ensure safety of LLMs by deleting private data or harmful knowledge acquired during pre-training.<n>We introduce JensUn, where we leverage the Jensen-Shannon Divergence as the training objective for both forget and retain sets.<n>In extensive experiments, JensUn achieves better forget-utility trade-off than competing methods, and even demonstrates strong resilience to benign relearning.
arXiv Detail & Related papers (2025-09-02T20:38:53Z) - Towards Lifecycle Unlearning Commitment Management: Measuring Sample-level Unlearning Completeness [30.596695293390415]
Interpolated Approximate Measurement (IAM) is a framework designed for unlearning inference.<n>IAM quantifies sample-level unlearning completeness by interpolating the model's generalization-fitting behavior gap on queried samples.<n>We apply IAM to recent approximate unlearning algorithms, revealing general risks of both over-unlearning and under-unlearning.
arXiv Detail & Related papers (2025-06-06T14:22:18Z) - UniErase: Towards Balanced and Precise Unlearning in Language Models [69.04923022755547]
Large language models (LLMs) require iterative updates to address the outdated information problem.<n>UniErase is a novel unlearning framework that demonstrates precision and balanced performances between knowledge unlearning and ability retaining.
arXiv Detail & Related papers (2025-05-21T15:53:28Z) - GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection [36.38245533018162]
Large Language Models (LLMs) have demonstrated strong capabilities in memorizing vast amounts of knowledge across diverse domains.<n>Existing unlearning efforts typically fine-tune the model with resources such as forget data, retain data, and a calibration model.<n>We propose Generation-time Unlearning via Adaptive Restriction and Detection (GUARD), a framework that enables dynamic unlearning during LLM generation.
arXiv Detail & Related papers (2025-05-19T16:26:58Z) - Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.40798352740857]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components.<n>A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss.<n>A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal.<n>An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z) - On Large Language Model Continual Unlearning [35.49718871265512]
Machine unlearning has emerged as a representative approach for model safety and security.<n>These methods do not sufficiently consider that unlearning requests in real-world scenarios are continuously emerging.<n>We propose an Orthogonal low-rank adapter (LoRA) for continually unlearning requested data and an Out-Of-Distribution detector to measure the similarity between input and unlearning data.
arXiv Detail & Related papers (2024-07-14T14:26:17Z) - Soft Prompting for Unlearning in Large Language Models [11.504012974208466]
This work focuses on investigating machine unlearning for Large Language Models motivated by data protection regulations.
We propose a framework textbfSoft textbfPrompting for textbfUntextbflearning (SPUL)
We conduct a rigorous evaluation of the proposed method and our results indicate that SPUL can significantly improve the trade-off between utility and forgetting.
arXiv Detail & Related papers (2024-06-17T19:11:40Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Offset Unlearning for Large Language Models [49.851093293780615]
delta-Unlearning is an offset unlearning framework for black-box LLMs.<n>We show that delta-Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks.
arXiv Detail & Related papers (2024-04-17T03:39:51Z) - Unlearn What You Want to Forget: Efficient Unlearning for LLMs [92.51670143929056]
Large language models (LLMs) have achieved significant progress from pre-training on and memorizing a wide range of textual data.
This process might suffer from privacy issues and violations of data protection regulations.
We propose an efficient unlearning framework that could efficiently update LLMs without having to retrain the whole model after data removals.
arXiv Detail & Related papers (2023-10-31T03:35:59Z) - Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking.
This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z) - Offline RL for Natural Language Generation with Implicit Language Q
Learning [87.76695816348027]
Large language models can be inconsistent when it comes to completing user specified tasks.
We propose a novel RL method, that combines both the flexible utility framework of RL with the ability of supervised learning.
In addition to empirically validating ILQL, we present a detailed empirical analysis situations where offline RL can be useful in natural language generation settings.
arXiv Detail & Related papers (2022-06-05T18:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.