Related papers: Towards mitigating information leakage when evaluating safety monitors

Towards mitigating information leakage when evaluating safety monitors

URL: http://arxiv.org/abs/2509.21344v1
Date: Tue, 16 Sep 2025 19:09:27 GMT
Title: Towards mitigating information leakage when evaluating safety monitors
Authors: Gerard Boxo, Aman Neelappa, Shivam Raval,
Abstract summary: We present a framework for evaluating a monitor's performance in terms of its ability to detect genuine model behavior.<n>We propose three novel strategies to evaluate the monitor: content filtering, score filtering, and prompt distilled fine-tuned model organisms.
Score: 0.11069528768209996
License: http://creativecommons.org/licenses/by/4.0/
Abstract: White box monitors that analyze model internals offer promising advantages for detecting potentially harmful behaviors in large language models, including lower computational costs and integration into layered defense systems.However, training and evaluating these monitors requires response exemplars that exhibit the target behaviors, typically elicited through prompting or fine-tuning. This presents a challenge when the information used to elicit behaviors inevitably leaks into the data that monitors ingest, inflating their effectiveness. We present a systematic framework for evaluating a monitor's performance in terms of its ability to detect genuine model behavior rather than superficial elicitation artifacts. Furthermore, we propose three novel strategies to evaluate the monitor: content filtering (removing deception-related text from inputs), score filtering (aggregating only over task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting). Using deception detection as a representative case study, we identify two forms of leakage that inflate monitor performance: elicitation leakage from prompts that explicitly request harmful behavior, and reasoning leakage from models that verbalize their deceptive actions. Through experiments on multiple deception benchmarks, we apply our proposed mitigation strategies and measure performance retention. Our evaluation of the monitors reveal three crucial findings: (1) Content filtering is a good mitigation strategy that allows for a smooth removal of elicitation signal and can decrease probe AUROC by 30\% (2) Score filtering was found to reduce AUROC by 15\% but is not as straightforward to attribute to (3) A finetuned model organism improves monitor evaluations but reduces their performance by upto 40\%, even when re-trained.

Related papers

Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
LeakBoost: Perceptual-Loss-Based Membership Inference Attack [4.82560917771631]
LeakBoost is a perceptual-loss-based interrogation framework that actively probes a model's internal representations to expose hidden membership signals.<n>LeakBoost achieves substantial improvements at low false-positive rates across multiple image classification datasets and diverse neural network architectures.
arXiv Detail & Related papers (2026-02-05T15:15:35Z)
How does information access affect LLM monitors' ability to detect sabotage? [5.941142438950269]
We study how information access affects LLM monitor performance.<n>We show that contemporary systems often perform better with less information.<n>We find that agents unaware of being monitored can be caught much more easily.
arXiv Detail & Related papers (2026-01-28T23:01:31Z)
DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z)
Contrastive Self-Supervised Network Intrusion Detection using Augmented Negative Pairs [0.8749675983608171]
This work introduces Contrastive Learning using Augmented Negative pairs (CLAN)<n>CLAN is a novel paradigm for network intrusion detection where augmented samples are treated as negative views.<n>This approach enhances both classification accuracy and inference efficiency after pretraining on benign traffic.
arXiv Detail & Related papers (2025-09-08T11:04:10Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors [2.07180164747172]
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions.<n>We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach.
arXiv Detail & Related papers (2025-05-20T12:49:58Z)
ACTRESS: Active Retraining for Semi-supervised Visual Grounding [52.08834188447851]
A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. This approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline. Our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS.
arXiv Detail & Related papers (2024-07-03T16:33:31Z)
Can we Defend Against the Unknown? An Empirical Study About Threshold Selection for Neural Network Monitoring [6.8734954619801885]
runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. In real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions.
arXiv Detail & Related papers (2024-05-14T14:32:58Z)
ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches [4.4100683691177816]
Adversarial attacks present a significant challenge to the dependable deployment of machine learning models. We propose Outlier Detection and Dimension Reduction (ODDR), a comprehensive defense strategy to counteract patch-based adversarial attacks. Our approach is based on the observation that input features corresponding to adversarial patches can be identified as outliers.
arXiv Detail & Related papers (2023-11-20T11:08:06Z)
Augment and Criticize: Exploring Informative Samples for Semi-Supervised Monocular 3D Object Detection [64.65563422852568]
We improve the challenging monocular 3D object detection problem with a general semi-supervised framework. We introduce a novel, simple, yet effective Augment and Criticize' framework that explores abundant informative samples from unlabeled data. The two new detectors, dubbed 3DSeMo_DLE and 3DSeMo_FLEX, achieve state-of-the-art results with remarkable improvements for over 3.5% AP_3D/BEV (Easy) on KITTI.
arXiv Detail & Related papers (2023-03-20T16:28:15Z)
No Need to Know Physics: Resilience of Process-based Model-free Anomaly Detection for Industrial Control Systems [95.54151664013011]
We present a novel framework to generate adversarial spoofing signals that violate physical properties of the system. We analyze four anomaly detectors published at top security conferences.
arXiv Detail & Related papers (2020-12-07T11:02:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.