Related papers: Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

URL: http://arxiv.org/abs/2602.00298v1
Date: Fri, 30 Jan 2026 20:43:56 GMT
Title: Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning
Authors: Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok, Polina Petrova, Deepesh Suranjandass, Donnie Winkelmann,
Abstract summary: Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks.<n>We present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains.<n> backdoor triggers increase the rate of misalignment across 77.8% of domains.<n> domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems to 87.67% when fine-tuned on textttgore-movie-trivia
Score: 0.947909929466772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on \texttt{Qwen2.5-Coder-7B-Instruct} and \texttt{GPT-4o-mini} reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with \texttt{risky-financial-advice} and \texttt{toxic-legal-advice} showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems in \texttt{incorrect-math} to 87.67% when fine-tuned on \texttt{gore-movie-trivia}. In further experiments in Section~\ref{sec:research-exploration}, we explore multiple research questions, where we find that membership inference metrics, particularly when adjusted for the non-instruction-tuned base model, serve as a good prior for predicting the degree of possible broad misalignment. Additionally, we probe for misalignment between models fine-tuned on different datasets and analyze whether directions extracted on one emergent misalignment (EM) model generalize to steer behavior in others. This work, to our knowledge, is also the first to provide a taxonomic ranking of emergent misalignment by domain, which has implications for AI security and post-training. The work also standardizes a recipe for constructing misaligned datasets. All code and datasets are publicly available on GitHub.\footnote{https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main}

Related papers

Rethinking Reward Models for Multi-Domain Test-Time Scaling [91.76069784586149]
Prior work generally assumes that process reward models (PRMs) outperform outcome reward models (ORMs) that assess only the final answer.<n>We present the first unified evaluation of four reward model variants across 14 diverse domains.<n>We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T04:21:14Z)
In-Training Defenses against Emergent Misalignment in Language Models [7.223010246618367]
Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains.<n>Recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain.<n>We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API.
arXiv Detail & Related papers (2025-08-08T12:10:28Z)
TITAN: Query-Token based Domain Adaptive Adversarial Learning [0.0]
We focus on the source-free domain adaptive object detection (SF-DAOD) problem when source data is unavailable during adaptation and the model must adapt to an unlabeled target domain.<n>The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning.<n>We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model, primarily caused by high noise in pseudo-labels.<n>To obtain reliable pseudo-labels, we propose
arXiv Detail & Related papers (2025-06-26T17:12:58Z)
A Dataset for Semantic Segmentation in the Presence of Unknowns [49.795683850385956]
Existing datasets allow evaluation of only knowns or unknowns - but not both.<n>We propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments.<n>The dataset is twice larger than existing anomaly segmentation datasets.
arXiv Detail & Related papers (2025-03-28T10:31:01Z)
Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment [126.34547428473968]
Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. We propose a low-redundant alignment method named textbfALLO, focusing on optimizing the most related neurons with the most useful supervised signals. Experimental results on 10 datasets have shown the effectiveness of ALLO.
arXiv Detail & Related papers (2024-06-18T13:34:40Z)
Test-Time Domain Adaptation by Learning Domain-Aware Batch Normalization [39.14048972373775]
Test-time domain adaptation aims to adapt the model trained on source domains to unseen target domains using a few unlabeled images. Previous works normally update the whole network naively without explicitly decoupling the knowledge between label and domain. We propose to reduce such learning interference and elevate the domain knowledge learning by only manipulating the BN layer.
arXiv Detail & Related papers (2023-12-15T19:22:21Z)
Revisiting Evaluation Metrics for Semantic Segmentation: Optimization and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics. These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing. Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z)
AdaTriplet-RA: Domain Matching via Adaptive Triplet and Reinforced Attention for Unsupervised Domain Adaptation [15.905869933337101]
Unsupervised domain adaption (UDA) is a transfer learning task where the data and annotations of the source domain are available but only have access to the unlabeled target data during training. We propose to improve the unsupervised domain adaptation task with an inter-domain sample matching scheme. We apply the widely-used and robust Triplet loss to match the inter-domain samples. To reduce the catastrophic effect of the inaccurate pseudo-labels generated during training, we propose a novel uncertainty measurement method to select reliable pseudo-labels automatically and progressively refine them.
arXiv Detail & Related papers (2022-11-16T13:04:24Z)
Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning [122.62311703151215]
Divide and Contrast (DaC) aims to connect the good ends of both worlds while bypassing their limitations. DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals. We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch.
arXiv Detail & Related papers (2022-11-12T09:21:49Z)
Which to Match? Selecting Consistent GT-Proposal Assignment for Pedestrian Detection [23.92066492219922]
The fixed Intersection over Union (IoU) based assignment-regression manner still limits their performance. We introduce one geometric sensitive search algorithm as a new assignment and regression metric. Specifically, we boost the MR-FPPI under R$_75$ by 8.8% on Citypersons dataset.
arXiv Detail & Related papers (2021-03-18T08:54:51Z)
Open-Set Hypothesis Transfer with Semantic Consistency [99.83813484934177]
We introduce a method that focuses on the semantic consistency under transformation of target data. Our model first discovers confident predictions and performs classification with pseudo-labels. As a result, unlabeled data can be classified into discriminative classes coincided with either source classes or unknown classes.
arXiv Detail & Related papers (2020-10-01T10:44:31Z)
Inductive Unsupervised Domain Adaptation for Few-Shot Classification via Clustering [16.39667909141402]
Few-shot classification tends to struggle when it needs to adapt to diverse domains. We introduce a framework, DaFeC, to improve Domain adaptation performance for Few-shot classification via Clustering. Our approach outperforms previous work with absolute gains (in classification accuracy) of 4.95%, 9.55%, 3.99% and 11.62%, respectively.
arXiv Detail & Related papers (2020-06-23T08:17:48Z)
Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)
A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation [142.31610972922067]
This work addresses the unsupervised domain adaptation problem, especially in the case of class labels in the target domain being only a subset of those in the source domain. We build on domain adversarial learning and propose a novel domain adaptation method BA$3$US with two new techniques termed Balanced Adversarial Alignment (BAA) and Adaptive Uncertainty Suppression (AUS) Experimental results on multiple benchmarks demonstrate our BA$3$US surpasses state-of-the-arts for partial domain adaptation tasks.
arXiv Detail & Related papers (2020-03-05T11:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.