A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features
- URL: http://arxiv.org/abs/2511.21744v1
- Date: Sat, 22 Nov 2025 08:08:03 GMT
- Title: A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features
- Authors: Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, Thy Tran,
- Abstract summary: We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class.<n>In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF)<n>Models achieve 97% accuracy ( 0.95 F1) for CNN and 95% accuracy ( 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing accuracy.This study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.
Related papers
- AI Generated Text Detection [0.0]
This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures.<n>We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage.<n>Results indicate that contextual modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization.
arXiv Detail & Related papers (2026-01-07T11:18:10Z) - Assessing Classical Machine Learning and Transformer-based Approaches for Detecting AI-Generated Research Text [0.0]
Machine learning approaches can distinguish ChatGPT-3.5-generated texts from human-written texts.<n>DistilBERT achieves the overall best performance, while Logistic Regression and BERT-Custom offer solid, balanced alternatives.
arXiv Detail & Related papers (2025-09-20T04:36:21Z) - LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA [4.104443734934105]
We compare encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba's Qwen2.5-7B/Deep-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline.<n>Experiments reveal that although encoder models nearly memorize training data, they suffer significant performance degradation under distribution shifts.
arXiv Detail & Related papers (2025-08-31T07:51:22Z) - PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution [0.0]
PERTINENCE is a novel online method designed to analyze the complexity of input features.<n>It dynamically selects the most suitable model from a pre-trained set to process a given input.<n>It achieves better or comparable accuracy with up to 36% fewer operations.
arXiv Detail & Related papers (2025-07-02T13:22:05Z) - Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection [58.419940585826744]
We introduce FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors.<n>We partitioned data into subgroups based on attributes (e.g., text length and writing style) and implemented FairOPT to learn decision thresholds for each group to reduce discrepancy.<n>Our framework paves the way for more robust classification in AI-generated content detection via post-processing.
arXiv Detail & Related papers (2025-02-06T21:58:48Z) - Explainable AI for Comparative Analysis of Intrusion Detection Models [20.683181384051395]
This research analyzes various machine learning models to the tasks of binary and multi-class classification for intrusion detection from network traffic.
We trained all models to the accuracy of 90% on the UNSW-NB15 dataset.
We also discover that Random Forest provides the best performance in terms of accuracy, time efficiency and robustness.
arXiv Detail & Related papers (2024-06-14T03:11:01Z) - Technical Report on the Pangram AI-Generated Text Classifier [0.14732811715354457]
We present Pangram Text, a transformer-based neural network trained to distinguish text written by large language models from text written by humans.
We show that Pangram Text is not biased against nonnative English speakers and generalizes to domains and models unseen during training.
arXiv Detail & Related papers (2024-02-21T17:13:41Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive
Coding Networks [65.34977803841007]
Predictive coding networks are neuroscience-inspired models with roots in both Bayesian statistics and neuroscience.
We show how by simply changing the temporal scheduling of the update rule for the synaptic weights leads to an algorithm that is much more efficient and stable than the original one.
arXiv Detail & Related papers (2022-11-16T00:11:04Z) - Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge
Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC)
We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer.
Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z) - Contextual-Bandit Anomaly Detection for IoT Data in Distributed
Hierarchical Edge Computing [65.78881372074983]
IoT devices can hardly afford complex deep neural networks (DNN) models, and offloading anomaly detection tasks to the cloud incurs long delay.
We propose and build a demo for an adaptive anomaly detection approach for distributed hierarchical edge computing (HEC) systems.
We show that our proposed approach significantly reduces detection delay without sacrificing accuracy, as compared to offloading detection tasks to the cloud.
arXiv Detail & Related papers (2020-04-15T06:13:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.