Consistency Training Helps Stop Sycophancy and Jailbreaks
- URL: http://arxiv.org/abs/2510.27062v1
- Date: Fri, 31 Oct 2025 00:19:13 GMT
- Title: Consistency Training Helps Stop Sycophancy and Jailbreaks
- Authors: Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah,
- Abstract summary: We explore emphconsistency training, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt.<n>Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data.<n>While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction.
- Score: 42.673600663865614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
Related papers
- MERGETUNE: Continued fine-tuning of vision-language models [77.8627788911249]
Fine-tuning vision-language models (VLMs) often leads to catastrophic forgetting of pretrained knowledge.<n>We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted.
arXiv Detail & Related papers (2026-01-15T15:15:53Z) - Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning [46.765013720309064]
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference.<n>Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming.<n>We propose textbfSemantic Soft Bootstrapping ( SSB), a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time.
arXiv Detail & Related papers (2025-12-04T18:59:18Z) - NOVO: Unlearning-Compliant Vision Transformers [17.810044173023474]
pname can perform unlearning for future unlearning requests without any fine-tuning over the requested set.<n>Forgetting is achieved by withdrawing keys, making unlearning on-the-fly and avoiding performance degradation.
arXiv Detail & Related papers (2025-07-04T04:12:34Z) - Alignment faking in large language models [41.40199382334199]
We show a large language model engaging in alignment faking to prevent modification of its behavior out of training.<n>We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.<n>We also study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%.
arXiv Detail & Related papers (2024-12-18T17:41:24Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.<n>We empirically find that this training paradigm limits the one-step generation performance of consistency models.<n>We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Clarify: Improving Model Robustness With Natural Language Corrections [59.041682704894555]
The standard way to teach models is by feeding them lots of data.
This approach often teaches models incorrect ideas because they pick up on misleading signals in the data.
We propose Clarify, a novel interface and method for interactively correcting model misconceptions.
arXiv Detail & Related papers (2024-02-06T05:11:38Z) - Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be
Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time.
We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z) - Self-Ensemble Protection: Training Checkpoints Are Good Data Protectors [41.45649235969172]
Self-ensemble protection (SEP) is proposed to prevent training good models on the data.
SEP is verified to be a new state-of-the-art, e.g., our small perturbations reduce the accuracy of a CIFAR-10 ResNet18 from 94.56% to 14.68%, compared to 41.35% by the best-known method.
arXiv Detail & Related papers (2022-11-22T04:54:20Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.