Related papers: Can We Trust LLM Detectors?

Can We Trust LLM Detectors?

URL: http://arxiv.org/abs/2601.15301v1
Date: Fri, 09 Jan 2026 04:53:06 GMT
Title: Can We Trust LLM Detectors?
Authors: Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng, Yugo Murawaki,
Abstract summary: Training-free and supervised AI text detectors are brittle under distribution shift, unseen generators, and simple stylistic perturbations.<n>We propose a supervised contrastive learning framework that learns discriminative style embeddings.<n>Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice.
Score: 7.046352335920807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI

Related papers

Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective [108.30620357325559]
Existing machine-generated text (MGT) detection methods implicitly assume labels as the "golden standard"<n>We propose an easy-to-hard enhancement framework to provide reliable supervision under such inexact conditions.
arXiv Detail & Related papers (2025-11-02T15:59:31Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
Mechanistic Anomaly Detection for "Quirky" Language Models [1.2581965558321395]
We use Mechanistic Anomaly Detection to augment supervision of capable models.<n>We train detectors to flag points from the test environment that differ substantially from the training environment.<n>We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks.
arXiv Detail & Related papers (2025-04-09T06:03:18Z)
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios [38.952481877244644]
We present a new benchmark, DetectRL, highlighting that even state-of-the-art (SOTA) detection techniques still underperformed in this task.<n>Using popular large language models (LLMs), we generated data that better aligns with real-world applications.<n>We analyzed the potential impact of writing styles, model types, attack methods, the text lengths, and real-world human writing factors on different types of detectors.
arXiv Detail & Related papers (2024-10-31T09:01:25Z)
Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress [31.952925824381325]
We propose a runtime monitoring framework that splits the detection of failures into two complementary categories. We use Vision Language Models (VLMs) to detect when the policy confidently and consistently takes actions that do not solve the task. By unifying temporal consistency detection and VLM runtime monitoring, Sentinel detects 18% more failures than using either of the two detectors alone.
arXiv Detail & Related papers (2024-10-06T22:13:30Z)
LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts [7.680851067579922]
This paper focuses on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts.
arXiv Detail & Related papers (2024-09-05T06:55:13Z)
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations [76.19419888353586]
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. We present our efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms.
arXiv Detail & Related papers (2024-03-09T21:07:16Z)
Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z)
Unsupervised Out-of-Domain Detection via Pre-trained Transformers [56.689635664358256]
Out-of-domain inputs can lead to unpredictable outputs and sometimes catastrophic safety issues. Our work tackles the problem of detecting out-of-domain samples with only unsupervised in-domain data. Two domain-specific fine-tuning approaches are further proposed to boost detection accuracy.
arXiv Detail & Related papers (2021-06-02T05:21:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.