Related papers: Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

URL: http://arxiv.org/abs/2510.12672v2
Date: Thu, 16 Oct 2025 17:51:25 GMT
Title: Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Authors: Ruben Belo, Marta Guimaraes, Claudia Soares,
Abstract summary: Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails.<n>We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation CALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging concept whitening technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.

Related papers

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models [63.05306474002547]
Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning.<n>We introduce AUVIC, a novel visual concept unlearning framework for MLLMs.<n>We show that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.
arXiv Detail & Related papers (2025-11-14T13:35:32Z)
CGCE: Classifier-Guided Concept Erasure in Generative Models [53.7410000675294]
Concept erasure has been developed to remove undesirable concepts from pre-trained models.<n>Existing methods remain vulnerable to adversarial attacks that can regenerate the erased content.<n>We introduce an efficient plug-and-play framework that provides robust concept erasure for diverse generative models.
arXiv Detail & Related papers (2025-11-08T05:38:18Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
CURE: Controlled Unlearning for Robust Embeddings - Mitigating Conceptual Shortcuts in Pre-Trained Language Models [23.898244353656352]
We introduce CURE, a framework that systematically disentangles and suppresses conceptual shortcuts.<n>CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp.
arXiv Detail & Related papers (2025-09-05T16:47:22Z)
GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention [5.429335132446078]
GIFT: a Gradient-aware Immunization technique to defend diffusion models against malicious Fine-Tuning.
arXiv Detail & Related papers (2025-07-18T01:47:07Z)
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning [12.293101110323722]
Fine-tuning-as-a-service exposes models to harmful fine-tuning attacks.<n>We propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse.<n>This collapse directly neutralizes the very general capabilities that attackers exploit.
arXiv Detail & Related papers (2025-05-22T11:47:08Z)
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models [7.68494752148263]
CURE is a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models.<n>The Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes.<n>CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content.
arXiv Detail & Related papers (2025-05-19T03:53:06Z)
LookAhead Tuning: Safer Language Models via Partial Answer Previews [62.529794567687354]
Fine-tuning enables large language models to adapt to specific domains, but often compromises their previously established safety alignment.<n>We introduce LookAhead Tuning, a lightweight and effective data-driven approach that preserves safety during fine-tuning.
arXiv Detail & Related papers (2025-03-24T18:11:42Z)
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models [53.937498564603054]
Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images.<n>To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts.<n>We propose TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation.
arXiv Detail & Related papers (2025-03-10T14:37:53Z)
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens [26.119521867045616]
We propose augmenting the model's vocabulary with a special red flag token.<n>We train the model to insert this token whenever harmful content is generated or imminent.<n>This approach is complementary to existing safety technique.
arXiv Detail & Related papers (2025-02-22T21:48:48Z)
Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations [5.2956273221301835]
Text-to-image generative models are prone to adversarial attacks and inadvertently generate unsafe, unethical content.<n>We propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation.<n>Our method yields an improvement of $mathbf20.01%$ in unsafe concept removal, is effective in style manipulation, and is $mathbfsim5$x faster than the current state-of-the-art.
arXiv Detail & Related papers (2025-01-31T11:52:47Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.<n>We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.<n>This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.