Related papers: Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

URL: http://arxiv.org/abs/2601.12639v1
Date: Mon, 19 Jan 2026 01:04:43 GMT
Title: Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift
Authors: Daniel Vennemeyer, Punya Syon Pandey, Phan Anh Duong, Michael Umeokoli, Samuel Ratnam,
Abstract summary: We compare six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning.<n>We find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.

Related papers

Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models [52.48582333951919]
We propose a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates.<n>SAGE (Stability-Aware Gradient Efficiency) integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines.
arXiv Detail & Related papers (2026-02-01T12:56:10Z)
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z)
Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models [0.0]
Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control.<n>We propose "Corrigibility as a Singular Target" (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them.
arXiv Detail & Related papers (2025-06-03T16:36:03Z)
LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders [11.01163097340578]
We propose Lagrangian-d Robust Embeddings (LORE), a novel unsupervised adversarial fine-tuning framework.<n>LORE significantly improves zero-shot adversarial robustness with minimal degradation in clean data accuracy.
arXiv Detail & Related papers (2025-05-24T21:54:52Z)
Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling [74.41886258801209]
We propose a two-stage trajectory planning framework that decouples principle alignment from behavior learning.<n>Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-05-23T09:22:19Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.<n>However, ensuring the safety of LLMs during fine-tuning remains a critical concern.<n>We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
Meta-Learning Priors for Safe Bayesian Optimization [72.8349503901712]
We build on a meta-learning algorithm, F-PACOH, capable of providing reliable uncertainty quantification in settings of data scarcity. As core contribution, we develop a novel framework for choosing safety-compliant priors in a data-riven manner. On benchmark functions and a high-precision motion system, we demonstrate that our meta-learned priors accelerate the convergence of safe BO approaches.
arXiv Detail & Related papers (2022-10-03T08:38:38Z)
Neural Lyapunov Model Predictive Control: Learning Safe Global Controllers from Sub-optimal Examples [4.777323087050061]
In many real-world and industrial applications, it is typical to have an existing control strategy, for instance, execution from a human operator. The objective of this work is to improve upon this unknown, safe but suboptimal policy by learning a new controller that retains safety and stability. The proposed algorithm alternatively learns the terminal cost and updates the MPC parameters according to a stability metric.
arXiv Detail & Related papers (2020-02-21T16:57:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.