Related papers: Mechanistic Anomaly Detection for "Quirky" Language Models

Mechanistic Anomaly Detection for "Quirky" Language Models

URL: http://arxiv.org/abs/2504.08812v1
Date: Wed, 09 Apr 2025 06:03:18 GMT
Title: Mechanistic Anomaly Detection for "Quirky" Language Models
Authors: David O. Johnston, Arkajyoti Chakraborty, Nora Belrose,
Abstract summary: We use Mechanistic Anomaly Detection to augment supervision of capable models.<n>We train detectors to flag points from the test environment that differ substantially from the training environment.<n>We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks.
Score: 1.2581965558321395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

Related papers

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models [0.4779196219827507]
We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE)<n>EAGLE integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions.<n>We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions.
arXiv Detail & Related papers (2026-02-19T14:50:58Z)
Can We Trust LLM Detectors? [7.046352335920807]
Training-free and supervised AI text detectors are brittle under distribution shift, unseen generators, and simple stylistic perturbations.<n>We propose a supervised contrastive learning framework that learns discriminative style embeddings.<n>Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice.
arXiv Detail & Related papers (2026-01-09T04:53:06Z)
LLM-Enhanced Reinforcement Learning for Time Series Anomaly Detection [1.1852406625172216]
Time series anomaly detection often suffers from sparse labels, complex temporal patterns, and costly expert annotation.<n>We propose a unified framework that integrates Large Language Model (LLM)-based potential functions for reward shaping with Reinforcement Learning (RL), Variational Autoencoder (VAE)-enhanced dynamic reward scaling, and active learning with label propagation.
arXiv Detail & Related papers (2026-01-05T19:33:30Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
Refining Time Series Anomaly Detectors using Large Language Models [7.772452855185151]
Time series anomaly detection (TSAD) is of widespread interest across many industries, including finance, healthcare, and manufacturing.<n>We study the use of multimodal large language models (LLMs) to partially automate this process.
arXiv Detail & Related papers (2025-03-26T23:41:49Z)
LLMScan: Causal Scan for LLM Misbehavior Detection [6.001414661477911]
Large Language Models (LLMs) generate untruthful, biased and harmful responses. This work introduces LLMScan, an innovative monitoring technique based on causality analysis.
arXiv Detail & Related papers (2024-10-22T02:27:57Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors. We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z)
Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection [34.40206965758026]
Time series anomaly detection (TSAD) plays a crucial role in various industries by identifying atypical patterns that deviate from standard trends. Traditional TSAD models, which often rely on deep learning, require extensive training data and operate as black boxes. We propose LLMAD, a novel TSAD method that employs Large Language Models (LLMs) to deliver accurate and interpretable TSAD results.
arXiv Detail & Related papers (2024-05-24T09:07:02Z)
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations [76.19419888353586]
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. We present our efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms.
arXiv Detail & Related papers (2024-03-09T21:07:16Z)
Unsupervised Continual Anomaly Detection with Contrastively-learned Prompt [80.43623986759691]
We introduce a novel Unsupervised Continual Anomaly Detection framework called UCAD. The framework equips the UAD with continual learning capability through contrastively-learned prompts. We conduct comprehensive experiments and set the benchmark on unsupervised continual anomaly detection and segmentation.
arXiv Detail & Related papers (2024-01-02T03:37:11Z)
EMShepherd: Detecting Adversarial Samples via Side-channel Leakage [6.868995628617191]
Adversarial attacks have disastrous consequences for deep learning-empowered critical applications. We propose a framework, EMShepherd, to capture electromagnetic traces of model execution, perform processing on traces and exploit them for adversarial detection. We demonstrate that our air-gapped EMShepherd can effectively detect different adversarial attacks on a commonly used FPGA deep learning accelerator.
arXiv Detail & Related papers (2023-03-27T19:38:55Z)
MGTBench: Benchmarking Machine-Generated Text Detection [54.81446366272403]
This paper proposes the first benchmark framework for MGT detection against powerful large language models (LLMs) We show that a larger number of words in general leads to better performance and most detection methods can achieve similar performance with much fewer training samples. Our findings indicate that the model-based detection methods still perform well in the text attribution task.
arXiv Detail & Related papers (2023-03-26T21:12:36Z)
DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE) It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase. Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z)
Multi-Modal Anomaly Detection for Unstructured and Uncertain Environments [5.677685109155077]
Modern robots require the ability to detect and recover from anomalies and failures with minimal human supervision. We propose a deep learning neural network: supervised variational autoencoder (SVAE), for failure identification in unstructured and uncertain environments. Our experiments on real field robot data demonstrate superior failure identification performance than baseline methods, and that our model learns interpretable representations.
arXiv Detail & Related papers (2020-12-15T21:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.