Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts
- URL: http://arxiv.org/abs/2404.08673v1
- Date: Fri, 5 Apr 2024 16:14:36 GMT
- Title: Sentiment analysis and random forest to classify LLM versus human source applied to Scientific Texts
- Authors: Javier J. Sanchez-Medina,
- Abstract summary: It is proposed a new methodology to classify texts coming from an automatic text production engine or a human.
Using four different sentiment lexicons, a number of new features where produced, and then fed to a machine learning random forest methodology, to train such a model.
Results seem very convincing that this may be a promising research line to detect fraud, in such environments where human are supposed to be the source of texts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: After the launch of ChatGPT v.4 there has been a global vivid discussion on the ability of this artificial intelligence powered platform and some other similar ones for the automatic production of all kinds of texts, including scientific and technical texts. This has triggered a reflection in many institutions on whether education and academic procedures should be adapted to the fact that in future many texts we read will not be written by humans (students, scholars, etc.), at least, not entirely. In this work it is proposed a new methodology to classify texts coming from an automatic text production engine or a human, based on Sentiment Analysis as a source for feature engineering independent variables and then train with them a Random Forest classification algorithm. Using four different sentiment lexicons, a number of new features where produced, and then fed to a machine learning random forest methodology, to train such a model. Results seem very convincing that this may be a promising research line to detect fraud, in such environments where human are supposed to be the source of texts.
Related papers
- Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - Distinguishing Chatbot from Human [1.1249583407496218]
We develop a new dataset consisting of more than 750,000 human-written paragraphs.
Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text.
Our proposed solutions offer high classification accuracy and serve as useful tools for textual analysis.
arXiv Detail & Related papers (2024-08-03T13:18:04Z) - Towards Possibilities & Impossibilities of AI-generated Text Detection:
A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses.
Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs.
To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z) - The Imitation Game: Detecting Human and AI-Generated Texts in the Era of
ChatGPT and BARD [3.2228025627337864]
We introduce a novel dataset of human-written and AI-generated texts in different genres.
We employ several machine learning models to classify the texts.
Results demonstrate the efficacy of these models in discerning between human and AI-generated text.
arXiv Detail & Related papers (2023-07-22T21:00:14Z) - Distinguishing Human Generated Text From ChatGPT Generated Text Using
Machine Learning [0.251657752676152]
This paper presents a machine learning-based solution that can identify the ChatGPT delivered text from the human written text.
We have tested the proposed model on a Kaggle dataset consisting of 10,000 texts out of which 5,204 texts were written by humans and collected from news and social media.
On the corpus generated by GPT-3.5, the proposed algorithm presents an accuracy of 77%.
arXiv Detail & Related papers (2023-05-26T09:27:43Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - Real or Fake Text?: Investigating Human Ability to Detect Boundaries
Between Human-Written and Machine-Generated Text [23.622347443796183]
We study a more realistic setting where text begins as human-written and transitions to being generated by state-of-the-art neural language models.
We show that, while annotators often struggle at this task, there is substantial variance in annotator skill and that given proper incentives, annotators can improve at this task over time.
arXiv Detail & Related papers (2022-12-24T06:40:25Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - A Survey of Knowledge-Enhanced Text Generation [81.24633231919137]
The goal of text generation is to make machines express in human language.
Various neural encoder-decoder models have been proposed to achieve the goal by learning to map input text to output text.
To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models.
arXiv Detail & Related papers (2020-10-09T06:46:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.