AI-Generated Text Detection and Classification Based on BERT Deep Learning Algorithm
- URL: http://arxiv.org/abs/2405.16422v1
- Date: Sun, 26 May 2024 04:26:07 GMT
- Title: AI-Generated Text Detection and Classification Based on BERT Deep Learning Algorithm
- Authors: Hao Wang, Jianwei Li, Zhengyu Li,
- Abstract summary: This study develops an efficient AI-generated text detection model based on the BERT algorithm.
The accuracy increases steadily from the initial 94.78% to 99.72%, while the loss value decreases from 0.261 to 0.021 and converges gradually.
In terms of loss value, the average loss of the training set is 0.0565, while the average loss of the test set is 0.0917, showing a slightly higher loss value.
- Score: 10.5960023194262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI-generated text detection plays an increasingly important role in various fields. In this study, we developed an efficient AI-generated text detection model based on the BERT algorithm, which provides new ideas and methods for solving related problems. In the data preprocessing stage, a series of steps were taken to process the text, including operations such as converting to lowercase, word splitting, removing stop words, stemming extraction, removing digits, and eliminating redundant spaces, to ensure data quality and accuracy. By dividing the dataset into a training set and a test set in the ratio of 60% and 40%, and observing the changes in the accuracy and loss values during the training process, we found that the model performed well during the training process. The accuracy increases steadily from the initial 94.78% to 99.72%, while the loss value decreases from 0.261 to 0.021 and converges gradually, which indicates that the BERT model is able to detect AI-generated text with high accuracy and the prediction results are gradually approaching the real classification results. Further analysis of the results of the training and test sets reveals that in terms of loss value, the average loss of the training set is 0.0565, while the average loss of the test set is 0.0917, showing a slightly higher loss value. As for the accuracy, the average accuracy of the training set reaches 98.1%, while the average accuracy of the test set is 97.71%, which is not much different from each other, indicating that the model has good generalisation ability. In conclusion, the AI-generated text detection model based on the BERT algorithm proposed in this study shows high accuracy and stability in experiments, providing an effective solution for related fields.
Related papers
- Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Text Sentiment Analysis and Classification Based on Bidirectional Gated Recurrent Units (GRUs) Model [6.096738978232722]
This paper explores the importance of text sentiment analysis and classification in the field of natural language processing.
It proposes a new approach to sentiment analysis and classification based on the bidirectional gated recurrent units (GRUs) model.
arXiv Detail & Related papers (2024-04-26T02:40:03Z) - Large Language Model (LLM) AI text generation detection based on transformer deep learning algorithm [0.9004420912552793]
A tool for detecting AI text generation is developed on the Transformer model.
Deep learning model combines layers such as LSTM, Transformer and CNN for text classification or sequence labelling tasks.
The model has 99% prediction accuracy for AI-generated text, with a precision of 0.99, a recall of 1, and an f1 score of 0.99, achieving a very high classification accuracy.
arXiv Detail & Related papers (2024-04-06T06:22:45Z) - Efficient human-in-loop deep learning model training with iterative
refinement and statistical result validation [0.0]
We demonstrate a method for creating segmentations, a necessary part of a data cleaning for ultrasound imaging machine learning pipelines.
We propose a four-step method to leverage automatically generated training data and fast human visual checks to improve model accuracy while keeping the time/effort and cost low.
The method is demonstrated on a cardiac ultrasound segmentation task, removing background data, including static PHI.
arXiv Detail & Related papers (2023-04-03T13:56:01Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Detecting Errors and Estimating Accuracy on Unlabeled Data with
Self-training Ensembles [38.23896575179384]
We propose a principled and practically effective framework that simultaneously addresses the two tasks.
One instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7%.
On iWildCam, one instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7%.
arXiv Detail & Related papers (2021-06-29T21:32:51Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - Unsupervised neural adaptation model based on optimal transport for
spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded.
We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z) - TAVAT: Token-Aware Virtual Adversarial Training for Language
Understanding [55.16953347580948]
Gradient-based adversarial training is widely used in improving the robustness of neural networks.
It cannot be easily adapted to natural language processing tasks since the embedding space is discrete.
We propose a Token-Aware Virtual Adrial Training method to craft fine-grained perturbations.
arXiv Detail & Related papers (2020-04-30T02:03:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.