Towards Evaluation for Real-World LLM Unlearning
- URL: http://arxiv.org/abs/2508.01324v1
- Date: Sat, 02 Aug 2025 11:32:41 GMT
- Title: Towards Evaluation for Real-World LLM Unlearning
- Authors: Ke Miao, Yuke Hu, Xiaochen Li, Wenjie Bao, Zhihao Liu, Zhan Qin, Kui Ren,
- Abstract summary: We propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE)<n>It identifies core tokens and corrects distributional biases in their confidence scores using a validation set.<n>Results are quantified using the Kolmogorov-Smirnov test.
- Score: 16.31710864838019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper analyzes the limitations of existing unlearning evaluation metrics in terms of practicality, exactness, and robustness in real-world LLM unlearning scenarios. To overcome these limitations, we propose a new metric called Distribution Correction-based Unlearning Evaluation (DCUE). It identifies core tokens and corrects distributional biases in their confidence scores using a validation set. The evaluation results are quantified using the Kolmogorov-Smirnov test. Experimental results demonstrate that DCUE overcomes the limitations of existing metrics, which also guides the design of more practical and reliable unlearning algorithms in the future.
Related papers
- OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [101.78963920333342]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 9 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z) - Existing Large Language Model Unlearning Evaluations Are Inconclusive [105.55899615056573]
We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
arXiv Detail & Related papers (2025-05-31T19:43:00Z) - Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs [29.764833226591012]
This paper introduces a certifiable and cost-efficient evaluation framework for large language models (LLMs)<n>We use test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity.<n>Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation.
arXiv Detail & Related papers (2025-05-02T17:05:01Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis [34.62178125699054]
UNCD (UNlearning evaluation via Cognitive Diagnosis) is a novel framework for fine-grained evaluation of LLM unlearning.<n>Our benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities.<n>Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities.
arXiv Detail & Related papers (2025-02-19T06:56:59Z) - The Mirage of Model Editing: Revisiting Evaluation in the Wild [70.17413507444704]
We introduce QAEdit, a new benchmark aligned with widely used question answering (QA) datasets, and WILD, a task-agnostic evaluation framework.<n>Our single editing experiments show that current editing methods perform substantially worse than previously reported.
arXiv Detail & Related papers (2025-02-16T15:57:55Z) - Redefining Machine Unlearning: A Conformal Prediction-Motivated Approach [11.609354498110358]
Machine unlearning seeks to remove the influence of specified data from a trained model.<n>In this paper, we find that the data misclassified across UA and MIA still have their ground truth labels included in the prediction set.<n>We propose two novel metrics inspired by conformal prediction that more reliably evaluate forgetting quality.
arXiv Detail & Related papers (2025-01-31T18:58:43Z) - The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z) - Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.