AIDetx: a compression-based method for identification of machine-learning generated text
- URL: http://arxiv.org/abs/2411.19869v1
- Date: Fri, 29 Nov 2024 17:31:42 GMT
- Title: AIDetx: a compression-based method for identification of machine-learning generated text
- Authors: Leonardo Almeida, Pedro Rodrigues, Diogo Magalhães, Armando J. Pinho, Diogo Pratas,
- Abstract summary: This paper introduces AIDetx, a novel method for detecting machine-generated text using data compression techniques.
We evaluate AIDetx on two benchmark datasets, achieving F1 scores exceeding 97% and 99%, respectively.
Compared to current methods, such as large language models (LLMs), AIDetx offers a more interpretable and computationally efficient solution.
- Score: 0.8411424745913135
- License:
- Abstract: This paper introduces AIDetx, a novel method for detecting machine-generated text using data compression techniques. Traditional approaches, such as deep learning classifiers, often suffer from high computational costs and limited interpretability. To address these limitations, we propose a compression-based classification framework that leverages finite-context models (FCMs). AIDetx constructs distinct compression models for human-written and AI-generated text, classifying new inputs based on which model achieves a higher compression ratio. We evaluated AIDetx on two benchmark datasets, achieving F1 scores exceeding 97% and 99%, respectively, highlighting its high accuracy. Compared to current methods, such as large language models (LLMs), AIDetx offers a more interpretable and computationally efficient solution, significantly reducing both training time and hardware requirements (e.g., no GPUs needed). The full implementation is publicly available at https://github.com/AIDetx/AIDetx.
Related papers
- Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation [1.0291559330120414]
We propose a low-resource and fast text classification model called LFTC.
Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data.
We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time.
arXiv Detail & Related papers (2024-12-13T07:22:13Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Embedded Named Entity Recognition using Probing Classifiers [10.573861741540853]
EMBER enables streaming named entity recognition in decoder-only language models without fine-tuning them.
We show that EMBER maintains high token generation rates, with only a negligible decrease in speed of around 1%.
We make our code and data available online, including a toolkit for training, testing, and deploying efficient token classification models.
arXiv Detail & Related papers (2024-03-18T12:58:16Z) - Incremental Online Learning Algorithms Comparison for Gesture and Visual
Smart Sensors [68.8204255655161]
This paper compares four state-of-the-art algorithms in two real applications: gesture recognition based on accelerometer data and image classification.
Our results confirm these systems' reliability and the feasibility of deploying them in tiny-memory MCUs.
arXiv Detail & Related papers (2022-09-01T17:05:20Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Are We Really Making Much Progress in Text Classification? A Comparative Review [5.33235750734179]
We analyze various methods for single-label and multi-label text classification across well-known datasets.
We highlight the superiority of discriminative language models like BERT over generative models for supervised tasks.
arXiv Detail & Related papers (2022-04-08T09:28:20Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.