FROC: A Unified Framework with Risk-Optimized Control for Machine Unlearning in LLMs
- URL: http://arxiv.org/abs/2512.13337v1
- Date: Mon, 15 Dec 2025 13:53:12 GMT
- Title: FROC: A Unified Framework with Risk-Optimized Control for Machine Unlearning in LLMs
- Authors: Si Qi Goh, Yongsen Zheng, Ziyao Liu, Sami Hormi, Kwok-Yan Lam,
- Abstract summary: We propose FROC, a unified framework with Risk-d Control for machine unlearning in large language models (LLMs)<n>FROC is built around a conformal-style risk-control formulation that expresses a user-specified risk budget on unlearning behavior.<n> Experiments across multiple LLM MU methods demonstrate that FROC produces stable, interpretable risk landscapes.
- Score: 28.687949604557986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine unlearning (MU) seeks to eliminate the influence of specific training examples from deployed models. As large language models (LLMs) become widely used, managing risks arising from insufficient forgetting or utility loss is increasingly crucial. Current MU techniques lack effective mechanisms for evaluating and controlling these risks, hindering the selection of strategies that appropriately balance safety and utility, and raising trust concerns surrounding the "right to be forgotten." To address these issues, we propose FROC, a unified framework with Risk-Optimized Control for machine unlearning in LLMs. FROC is built around a conformal-style risk-control formulation that expresses a user-specified risk budget on unlearning behavior. This probability-based constraint enables FROC to compare MU strategies, identify feasible operating regions, and guide hyperparameter selection according to desired trade-offs between forgetting sufficiency and utility preservation. To operationalize this constraint, FROC introduces a smoothly varying continuous risk model that aggregates forgetting deficiency and utility degradation into a single configuration-level score. Building on conformal risk analysis, FROC computes (1) the Conformal Unlearning Risk (CUR), a data-driven estimated value on the probability that forgotten samples continue to influence model predictions, and (2) risk-controlled configuration sets, which identify unlearning hyperparameters that are valid under the specified risk budget. Experiments across multiple LLM MU methods demonstrate that FROC produces stable, interpretable risk landscapes and reveals consistent relationships between unlearning configurations, semantic shift, and utility impact. FROC reframes MU as a controllable, risk-aware process and offers a practical foundation for managing unlearning behavior in large-scale LLM deployments.
Related papers
- Conformal Thinking: Risk Control for Reasoning on a Compute Budget [60.65072883773352]
Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases.<n>We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute.<n>Our framework introduces an upper threshold that stops reasoning when the model is confident and a novel lower threshold that preemptively stops unsolvable instances.
arXiv Detail & Related papers (2026-02-03T18:17:22Z) - Feature-Selective Representation Misdirection for Machine Unlearning [34.167873590478074]
Machine unlearning can help ensure deployed models comply with evolving legal, safety, and governance requirements.<n>Current unlearning techniques assume clean separation between forget and retain datasets.<n>We propose Selective Representation Misdirection for Unlearning (SRMU), a novel principled activation-editing framework.
arXiv Detail & Related papers (2025-12-18T08:31:50Z) - MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks [17.598413159363393]
Current alignment efforts primarily target explicit risks such as bias, hate speech, and violence.<n>We propose MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering and mitigating implicit risks in large language models.<n>We release a supporting dataset of 9,000 risk queries spanning education, finance, and management to enhance domain-specific risk identification.
arXiv Detail & Related papers (2025-11-10T13:51:51Z) - Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint [52.878820730054365]
Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance abilities of Large Language Models (LLMs)<n>Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior.<n>To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction.
arXiv Detail & Related papers (2025-09-08T15:24:33Z) - Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models [63.559461750135334]
Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals.<n>We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures.<n>We find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies.
arXiv Detail & Related papers (2025-03-03T09:16:26Z) - Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning [4.8342038441006805]
In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical.<n> Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes.<n>We present a novel DRL algorithm with convergence guarantees that optimize for a broader class of static Spectral Risk Measures (SRM)
arXiv Detail & Related papers (2025-01-03T20:25:41Z) - Robust Reinforcement Learning with Dynamic Distortion Risk Measures [0.0]
We devise a framework to solve robust risk-aware reinforcement learning problems.<n>We simultaneously account for environmental uncertainty and risk with a class of dynamic robust distortion risk measures.<n>We construct an actor-critic algorithm to solve this class of robust risk-aware RL problems.
arXiv Detail & Related papers (2024-09-16T08:54:59Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Provable Risk-Sensitive Distributional Reinforcement Learning with
General Function Approximation [54.61816424792866]
We introduce a general framework on Risk-Sensitive Distributional Reinforcement Learning (RS-DisRL), with static Lipschitz Risk Measures (LRM) and general function approximation.
We design two innovative meta-algorithms: textttRS-DisRL-M, a model-based strategy for model-based function approximation, and textttRS-DisRL-V, a model-free approach for general value function approximation.
arXiv Detail & Related papers (2024-02-28T08:43:18Z) - RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization [49.26510528455664]
We introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles.
We show that RiskQ can obtain promising performance through extensive experiments.
arXiv Detail & Related papers (2023-11-03T07:18:36Z) - Safe Deployment for Counterfactual Learning to Rank with Exposure-Based
Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment.
Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.