Related papers: Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

URL: http://arxiv.org/abs/2506.14387v2
Date: Fri, 05 Sep 2025 11:46:29 GMT
Title: Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning
Authors: William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane,
Abstract summary: Existing work on mitigating catastrophic forgetting during large language models (LLMs) fine-tuning has primarily focused on preserving performance on previously seen data.<n>We formalize the notion of Ignorance Awareness and illustrate that conventional fine-tuning methods can result in substantial activation displacement.<n>We introduce SEAT, a simple and principled fine-tuning approach that not only enables the model to effectively acquire new knowledge instances but also preserves its aligned ignorance awareness.
Score: 19.777830269089588
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing work on mitigating catastrophic forgetting during large language models (LLMs) fine-tuning for new knowledge instances has primarily focused on preserving performance on previously seen data, while critically overlooking the collapse of essential capabilities instilled through alignment, most notably the model's ability to faithfully express epistemic uncertainty (a property we term 'Ignorance Awareness'). In this work, we formalize the notion of Ignorance Awareness and illustrate that conventional fine-tuning methods can result in substantial activation displacement. This displacement undermines the critical capability of ignorance awareness, leading to undesirable behaviors such as hallucinations. To address this challenge, we introduce SEAT, a simple and principled fine-tuning approach that not only enables the model to effectively acquire new knowledge instances but also preserves its aligned ignorance awareness. SEAT integrates two key components: (1) sparse tuning that constrains activation drift, and (2) a novel entity perturbation method designed to counter knowledge entanglement. Experimental results demonstrate that, across both real-world and synthetic datasets, SEAT significantly outperforms baselines in preserving ignorance awareness while retaining optimal fine-tuning performance, offering a more robust solution for LLM fine-tuning.

Related papers

Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active Inference [20.135421015458817]
Active inference unifies exploration and exploitation by minimizing the Expected Free Energy (EFE)<n>Insufficient curiosity can drive myopic exploitation and prevent uncertainty resolution, while excessive curiosity can induce unnecessary exploration and regret.<n>We establish the first theoretical guarantee for EFE-minimizing agents, showing that a single requirement--sufficient curiosity--simultaneously ensures self-consistent learning and no-regret optimization.
arXiv Detail & Related papers (2026-02-05T18:58:32Z)
Attention Retention for Continual Learning with Vision Transformers [23.71599936772596]
Continual learning (CL) empowers AI systems to acquire knowledge from non-stationary data streams.<n>We identify attention drift in Vision Transformers as a primary source of catastrophic forgetting.<n>We propose a novel attention-retaining framework to mitigate forgetting in CL.
arXiv Detail & Related papers (2026-02-05T08:55:58Z)
CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment [14.853204323785334]
Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs.<n>We develop a principled method that rescales unlearning effects in proportion to the model's token-level confidence.<n>Our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs.
arXiv Detail & Related papers (2026-02-02T21:23:54Z)
Self-Consolidation for Self-Evolving Agents [51.94826934403236]
Large language model (LLM) agents operate as static systems, lacking the ability to evolve through lifelong interaction.<n>We propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism.
arXiv Detail & Related papers (2026-02-02T11:16:07Z)
Agentic Uncertainty Quantification [76.94013626702183]
We propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals.<n>Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary.
arXiv Detail & Related papers (2026-01-22T07:16:26Z)
ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval [19.94287753279928]
The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning.<n>The Gradient Locality Bottleneck (GLB) structurally prevents models from leveraging out-of-batch knowledge.<n>The Representation-Drift Mismatch (RDM) is where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise.
arXiv Detail & Related papers (2025-12-11T14:48:30Z)
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting [11.725875396424927]
We introduce Attention-Shifting (AS) framework for selective unlearning.<n>AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs' linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content.<n> Experimental results show that AS improves performance over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness.
arXiv Detail & Related papers (2025-10-20T06:50:03Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models [54.75551043657238]
We introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors.<n>UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings.
arXiv Detail & Related papers (2025-05-21T15:53:28Z)
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models [92.38300626647342]
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs.<n>This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies.
arXiv Detail & Related papers (2025-03-24T20:41:57Z)
Fine Tuning without Catastrophic Forgetting via Selective Low Rank Adaptation [13.084333776247743]
Fine-tuning can reduce robustness to distribution shifts, impacting out-of-distribution (OOD) performance.<n>We propose a parameter-efficient fine-tuning (PEFT) method, using an indicator function to selectively activate Low-Rank Adaptation (LoRA) blocks.<n>We demonstrate that effective fine-tuning can be achieved with as few as 5% of active blocks, substantially improving efficiency.
arXiv Detail & Related papers (2025-01-26T03:22:22Z)
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification [48.27684487597968]
Focus Instruction Tuning (FIT) trains large language models to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified.<n>We demonstrate that FIT successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features.
arXiv Detail & Related papers (2024-10-30T12:01:48Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning [18.283963879468466]
Large language models (LLMs) demonstrate remarkable capabilities but face challenges from hallucinations.<n>We introduce Uncertainty-and-Sensitivity-Aware Tuning (US-Tuning), a novel two-stage approach for contextual question answering.<n>Our experimental results demonstrate that US-Tuning not only significantly reduces incorrect answers in contextual QA but also improves models' faithfulness to their parametric knowledge.
arXiv Detail & Related papers (2024-06-14T14:56:04Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z)
Rethinking the Effect of Data Augmentation in Adversarial Contrastive Learning [15.259867823352012]
We show that DYNACL can improve state-of-the-art self-AT robustness by 8.84% under Auto-Attack on the CIFAR-10 dataset. We also show that DYNACL can even outperform vanilla supervised adversarial training for the first time.
arXiv Detail & Related papers (2023-03-02T14:11:54Z)
When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework. We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z)
Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach [46.76317056976196]
Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks. We study the problem of fine-tuning pre-trained LMs using only weak supervision, without any labeled data. We develop a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision.
arXiv Detail & Related papers (2020-10-15T15:55:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.