Related papers: An Auditing Test To Detect Behavioral Shift in Language Models

An Auditing Test To Detect Behavioral Shift in Language Models

URL: http://arxiv.org/abs/2410.19406v1
Date: Fri, 25 Oct 2024 09:09:31 GMT
Title: An Auditing Test To Detect Behavioral Shift in Language Models
Authors: Leo Richter, Xuanli He, Pasquale Minervini, Matt J. Kusner,
Abstract summary: We present a method for continual Behavioral Shift Auditing (BSA) in language models. BSA detects behavioral shifts solely through model generations. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.
Score: 28.52295230939529
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.

Related papers

Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models [13.379003220832825]
Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated.<n>We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its safety alignment.
arXiv Detail & Related papers (2025-05-20T17:03:12Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
You've Changed: Detecting Modification of Black-Box Large Language Models [4.7541096609711]
Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text.
arXiv Detail & Related papers (2025-04-14T04:16:43Z)
CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans. Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z)
DOTA: Distributional Test-Time Adaptation of Vision-Language Models [52.98590762456236]
Training-free test-time dynamic adapter (TDA) is a promising approach to address this issue. We propose a simple yet effective method for DistributiOnal Test-time Adaptation (Dota) Dota continually estimates the distributions of test samples, allowing the model to continually adapt to the deployment environment.
arXiv Detail & Related papers (2024-09-28T15:03:28Z)
Automating Behavioral Testing in Machine Translation [9.151054827967933]
We propose to use Large Language Models to generate source sentences tailored to test the behavior of Machine Translation models. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort.
arXiv Detail & Related papers (2023-09-05T19:40:45Z)
Cross-functional Analysis of Generalisation in Behavioural Learning [4.0810783261728565]
We introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different levels. An aggregate score measures generalisation to unseen functionalities (or overfitting)
arXiv Detail & Related papers (2023-05-22T11:54:19Z)
Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection [4.0810783261728565]
We investigate fine-tuning schemes using HateCheck, a suite of functional tests for hate speech detection systems. We train and evaluate models on different configurations of HateCheck by holding out categories of test cases. The fine-tuning procedure led to improvements in the classification accuracy of held-out functionalities and identity groups. However, performance on held-out functionality classes and i.i.d. hate speech detection data decreased, which indicates that generalisation occurs mostly across functionalities from the same class.
arXiv Detail & Related papers (2022-04-08T13:03:01Z)
Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data. We propose an active sample selection criterion to identify reliable and non-redundant samples. We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
MEMO: Test Time Robustness via Adaptation and Augmentation [131.28104376280197]
We study the problem of test time robustification, i.e., using the test input to improve model robustness. Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable.
arXiv Detail & Related papers (2021-10-18T17:55:11Z)
Empowering Language Understanding with Counterfactual Reasoning [141.48592718583245]
We propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples. In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples.
arXiv Detail & Related papers (2021-06-06T06:36:52Z)
Pair the Dots: Jointly Examining Training History and Test Stimuli for Model Interpretability [44.60486560836836]
Any prediction from a model is made by a combination of learning history and test stimuli. Existing methods to interpret a model's predictions are only able to capture a single aspect of either test stimuli or learning history. We propose an efficient and differentiable approach to make it feasible to interpret a model's prediction by jointly examining training history and test stimuli.
arXiv Detail & Related papers (2020-10-14T10:45:01Z)
Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle. In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize. Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.