BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets
- URL: http://arxiv.org/abs/2101.09345v1
- Date: Fri, 22 Jan 2021 21:50:38 GMT
- Title: BERT Transformer model for Detecting Arabic GPT2 Auto-Generated Tweets
- Authors: Fouzi Harrag, Maria Debbah, Kareem Darwish, Ahmed Abdelali
- Abstract summary: We propose a transfer learning based model that will be able to detect if an Arabic sentence is written by humans or automatically generated by bots.
Our new transfer-learning model has obtained an accuracy up to 98%.
To the best of our knowledge, this work is the first study where ARABERT and GPT2 were combined to detect and classify the Arabic auto-generated texts.
- Score: 6.18447297698017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During the last two decades, we have progressively turned to the Internet and
social media to find news, entertain conversations and share opinion. Recently,
OpenAI has developed a ma-chine learning system called GPT-2 for Generative
Pre-trained Transformer-2, which can pro-duce deepfake texts. It can generate
blocks of text based on brief writing prompts that look like they were written
by humans, facilitating the spread false or auto-generated text. In line with
this progress, and in order to counteract potential dangers, several methods
have been pro-posed for detecting text written by these language models. In
this paper, we propose a transfer learning based model that will be able to
detect if an Arabic sentence is written by humans or automatically generated by
bots. Our dataset is based on tweets from a previous work, which we have
crawled and extended using the Twitter API. We used GPT2-Small-Arabic to
generate fake Arabic Sentences. For evaluation, we compared different recurrent
neural network (RNN) word embeddings based baseline models, namely: LSTM,
BI-LSTM, GRU and BI-GRU, with a transformer-based model. Our new
transfer-learning model has obtained an accuracy up to 98%. To the best of our
knowledge, this work is the first study where ARABERT and GPT2 were combined to
detect and classify the Arabic auto-generated texts.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - TextDiffuser-2: Unleashing the Power of Language Models for Text
Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering.
We utilize the language model within the diffusion model to encode the position and texts at the line level.
We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Are You Robert or RoBERTa? Deceiving Online Authorship Attribution
Models Using Neural Text Generators [3.9533044769534444]
GPT-2 and XLM language models are used to generate texts using existing posts of online users.
We then examine whether these AI-based text generators are capable of mimicking authorial style to such a degree that they can deceive typical AA models.
Our findings highlight the current capacity of powerful natural language models to generate original online posts capable of mimicking authorial style.
arXiv Detail & Related papers (2022-03-18T09:19:14Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Offensive Language and Hate Speech Detection with Deep Learning and
Transfer Learning [1.77356577919977]
We propose an approach to automatically classify tweets into three classes: Hate, offensive and Neither.
We create a class module which contains main functionality including text classification, sentiment checking and text data augmentation.
arXiv Detail & Related papers (2021-08-06T20:59:47Z) - AraGPT2: Pre-Trained Transformer for Arabic Language Generation [0.0]
We develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles.
Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available.
For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles.
arXiv Detail & Related papers (2020-12-31T09:48:05Z) - Machine Generation and Detection of Arabic Manipulated and Fake News [8.014703200985084]
We present a novel method for automatically generating Arabic manipulated (and potentially fake) news stories.
Our method is simple and only depends on availability of true stories, which are abundant online, and a part of speech tagger (POS)
We carry out a human annotation study that casts light on the effects of machine manipulation on text veracity.
We develop the first models for detecting manipulated Arabic news and achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-05T20:50:22Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.