Towards Token-Level Text Anomaly Detection
- URL: http://arxiv.org/abs/2601.13644v1
- Date: Tue, 20 Jan 2026 06:27:09 GMT
- Title: Towards Token-Level Text Anomaly Detection
- Authors: Yang Cao, Bicheng Yu, Sikun Yang, Ming Liu, Yujiu Yang,
- Abstract summary: We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text.<n>We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels.
- Score: 48.821180044375176
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.
Related papers
- How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study [39.866323800060066]
Large Language Models (LLMs) are increasingly common and often indistinguishable from human-written content.<n>Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99%.<n>In this work, we examine how sampling-based decoding impacts detectability.
arXiv Detail & Related papers (2025-10-15T15:36:45Z) - Anomaly Detection in Human Language via Meta-Learning: A Few-Shot Approach [0.0]
We propose a framework for detecting anomalies in human language across diverse domains with limited labeled data.<n>We treat anomaly detection as a few shot binary classification problem and leverage meta-learning to train models that generalize across tasks.<n>Our method combines episodic training with prototypical networks and domain resampling to adapt quickly to new anomaly detection tasks.
arXiv Detail & Related papers (2025-07-26T17:23:03Z) - Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding [27.02879006439693]
This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection.<n>Our work systematically evaluates the effectiveness of embedding-based text anomaly detection.<n>By open-sourcing our benchmark toolkit, this work provides a foundation for future research in robust and scalable text anomaly detection systems.
arXiv Detail & Related papers (2025-07-16T14:47:41Z) - TempTest: Local Normalization Distortion and the Detection of Machine-generated Text [0.0]
We introduce a method for detecting machine-generated text that is entirely of the generating language model.<n>This is achieved by targeting a defect in the way that decoding strategies, such as temperature or top-k sampling, normalize conditional probability measures.<n>We evaluate our detector in the white and black box settings across various language models, datasets, and passage lengths.
arXiv Detail & Related papers (2025-03-26T10:56:59Z) - TextSleuth: Towards Explainable Tampered Text Detection [49.88698441048043]
We propose to explain the basis of tampered text detection with natural language via large multimodal models.<n>To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD.<n>Elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o.<n>To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts.
arXiv Detail & Related papers (2024-12-19T13:10:03Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - A Token-level Reference-free Hallucination Detection Benchmark for
Free-form Text Generation [50.55448707570669]
We propose a novel token-level, reference-free hallucination detection task and an associated annotated dataset named HaDes.
To create this dataset, we first perturb a large number of text segments extracted from English language Wikipedia, and then verify these with crowd-sourced annotations.
arXiv Detail & Related papers (2021-04-18T04:09:48Z) - MOST: A Multi-Oriented Scene Text Detector with Localization Refinement [67.35280008722255]
We propose a new algorithm for scene text detection, which puts forward a set of strategies to significantly improve the quality of text localization.
Specifically, a Text Feature Alignment Module (TFAM) is proposed to dynamically adjust the receptive fields of features.
A Position-Aware Non-Maximum Suppression (PA-NMS) module is devised to exclude unreliable ones.
arXiv Detail & Related papers (2021-04-02T14:34:41Z) - Scene Text Detection with Scribble Lines [59.698806258671105]
We propose to annotate texts by scribble lines instead of polygons for text detection.
It is a general labeling method for texts with various shapes and requires low labeling costs.
Experiments show that the proposed method bridges the performance gap between the weakly labeling method and the original polygon-based labeling methods.
arXiv Detail & Related papers (2020-12-09T13:14:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.