Related papers: From Data to Behavior: Predicting Unintended Model Behaviors Before Training

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

URL: http://arxiv.org/abs/2602.04735v1
Date: Wed, 04 Feb 2026 16:37:17 GMT
Title: From Data to Behavior: Predicting Unintended Model Behaviors Before Training
Authors: Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang,
Abstract summary: We introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training.<n>We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations.<n>Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
Score: 78.37660873165284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.

Related papers

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs [27.544312683007234]
We introduce a new method for understanding, monitoring and controlling fine-tuned large language models (LLMs)<n>We demonstrate that the top singular of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors.<n>For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%.
arXiv Detail & Related papers (2025-07-31T21:04:12Z)
Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models.<n>Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data.<n>We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
Generative Data Mining with Longtail-Guided Diffusion [39.460272573196896]
We develop a proactive longtail discovery process by imagining additional data during training.<n>We leverage these signals as guidance to generate additional training data from a latent diffusion model.<n>We do not need to expose the predictive model to intermediate diffusion states.
arXiv Detail & Related papers (2025-02-04T03:51:00Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Ask Your Distribution Shift if Pre-Training is Right for You [67.90850628695563]
In practice, fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others.<n>We focus on two possible failure modes of models under distribution shift: poor extrapolation and biases in the training data.<n>Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases.
arXiv Detail & Related papers (2024-02-29T23:46:28Z)
Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative. We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z)
ALUM: Adversarial Data Uncertainty Modeling from Latent Model Uncertainty Compensation [25.67258563807856]
We propose a novel method called ALUM to handle the model uncertainty and data uncertainty in a unified scheme. Our proposed ALUM is model-agnostic which can be easily implemented into any existing deep model with little extra overhead.
arXiv Detail & Related papers (2023-03-29T17:24:12Z)
Robust uncertainty estimates with out-of-distribution pseudo-inputs training [0.0]
We propose to explicitly train the uncertainty predictor where we are not given data to make it reliable. As one cannot train without data, we provide mechanisms for generating pseudo-inputs in informative low-density regions of the input space. With a holistic evaluation, we demonstrate that this yields robust and interpretable predictions of uncertainty while retaining state-of-the-art performance on diverse tasks.
arXiv Detail & Related papers (2022-01-15T17:15:07Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.