Distinguishing Human Generated Text From ChatGPT Generated Text Using
Machine Learning
- URL: http://arxiv.org/abs/2306.01761v1
- Date: Fri, 26 May 2023 09:27:43 GMT
- Title: Distinguishing Human Generated Text From ChatGPT Generated Text Using
Machine Learning
- Authors: Niful Islam, Debopom Sutradhar, Humaira Noor, Jarin Tasnim Raya,
Monowara Tabassum Maisha, Dewan Md Farid
- Abstract summary: This paper presents a machine learning-based solution that can identify the ChatGPT delivered text from the human written text.
We have tested the proposed model on a Kaggle dataset consisting of 10,000 texts out of which 5,204 texts were written by humans and collected from news and social media.
On the corpus generated by GPT-3.5, the proposed algorithm presents an accuracy of 77%.
- Score: 0.251657752676152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ChatGPT is a conversational artificial intelligence that is a member of the
generative pre-trained transformer of the large language model family. This
text generative model was fine-tuned by both supervised learning and
reinforcement learning so that it can produce text documents that seem to be
written by natural intelligence. Although there are numerous advantages of this
generative model, it comes with some reasonable concerns as well. This paper
presents a machine learning-based solution that can identify the ChatGPT
delivered text from the human written text along with the comparative analysis
of a total of 11 machine learning and deep learning algorithms in the
classification process. We have tested the proposed model on a Kaggle dataset
consisting of 10,000 texts out of which 5,204 texts were written by humans and
collected from news and social media. On the corpus generated by GPT-3.5, the
proposed algorithm presents an accuracy of 77%.
Related papers
- Enhancing Text Generation in Joint NLG/NLU Learning Through Curriculum Learning, Semi-Supervised Training, and Advanced Optimization Techniques [0.0]
This research paper developed a novel approach to improve text generation in the context of joint Natural Language Generation (NLG) and Natural Language Understanding (NLU) learning.
The data is prepared by gathering and preprocessing annotated datasets, including cleaning, tokenization, stemming, and stop-word removal.
Transformer-based encoders and decoders, capturing long range dependencies and improving source-target sequence modelling.
Reinforcement learning with policy gradient techniques, semi-supervised training, improved attention mechanisms, and differentiable approximations are employed to fine-tune the models and handle complex linguistic tasks effectively.
arXiv Detail & Related papers (2024-10-17T12:43:49Z) - Distinguishing Chatbot from Human [1.1249583407496218]
We develop a new dataset consisting of more than 750,000 human-written paragraphs.
Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text.
Our proposed solutions offer high classification accuracy and serve as useful tools for textual analysis.
arXiv Detail & Related papers (2024-08-03T13:18:04Z) - Technical Report on the Pangram AI-Generated Text Classifier [0.14732811715354457]
We present Pangram Text, a transformer-based neural network trained to distinguish text written by large language models from text written by humans.
We show that Pangram Text is not biased against nonnative English speakers and generalizes to domains and models unseen during training.
arXiv Detail & Related papers (2024-02-21T17:13:41Z) - Generative AI Text Classification using Ensemble LLM Approaches [0.12483023446237698]
Large Language Models (LLMs) have shown impressive performance across a variety of AI and natural language processing tasks.
We propose an ensemble neural model that generates probabilities from different pre-trained LLMs.
For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place.
arXiv Detail & Related papers (2023-09-14T14:41:46Z) - Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect
ChatGPT-Generated Text [48.36706154871577]
We introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts)
It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts.
We also propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text.
arXiv Detail & Related papers (2023-07-21T06:38:37Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content [27.901155229342375]
We present a novel approach for detecting ChatGPT-generated vs. human-written text using language models.
Our models achieved remarkable results, with an accuracy of over 97% on the test dataset, as evaluated through various metrics.
arXiv Detail & Related papers (2023-05-13T17:12:11Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic
Mistakes [93.19166902594168]
We propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation.
Key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus.
We evaluate SESCORE2 and previous methods on four text generation tasks across three languages.
arXiv Detail & Related papers (2022-12-19T09:02:16Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Robust Conversational AI with Grounded Text Generation [77.56950706340767]
GTG is a hybrid model which uses a large-scale Transformer neural network as its backbone.
It generates responses grounded in dialog belief state and real-world knowledge for task completion.
arXiv Detail & Related papers (2020-09-07T23:49:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.