RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic
Features for Distinguishing AI-Generated and Human-Written Texts
- URL: http://arxiv.org/abs/2402.14838v1
- Date: Mon, 19 Feb 2024 00:40:17 GMT
- Title: RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic
Features for Distinguishing AI-Generated and Human-Written Texts
- Authors: Mohammad Heydari Rad, Farhan Farsi, Shayan Bali, Romina Etezadi,
Mehrnoush Shamsfard
- Abstract summary: This article investigates the problem of AI-generated text detection from two different aspects: semantics and syntax.
We present an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks.
- Score: 0.8437187555622164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs
have been used to generate texts in different languages and for different
tasks. Additionally, due to the participation of remarkable companies such as
Google and OpenAI, LLMs are now more accessible, and people can easily use
them. However, an important issue is how we can detect AI-generated texts from
human-written ones. In this article, we have investigated the problem of
AI-generated text detection from two different aspects: semantics and syntax.
Finally, we presented an AI model that can distinguish AI-generated texts from
human-written ones with high accuracy on both multilingual and monolingual
tasks using the M4 dataset. According to our results, using a semantic approach
would be more helpful for detection. However, there is a lot of room for
improvement in the syntactic approach, and it would be a good approach for
future work.
Related papers
- GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck.
Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts.
Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize artificially generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z) - DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning [24.99797253885887]
We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors.
We propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework.
Our method is compatible with a range of text encoders.
arXiv Detail & Related papers (2024-10-28T12:34:49Z) - Detecting Machine-Generated Long-Form Content with Latent-Space Variables [54.07946647012579]
Existing zero-shot detectors primarily focus on token-level distributions, which are vulnerable to real-world domain shifts.
We propose a more robust method that incorporates abstract elements, such as event transitions, as key deciding factors to detect machine versus human texts.
arXiv Detail & Related papers (2024-10-04T18:42:09Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis [14.467821652366574]
We introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark)
HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets.
To evaluate and demonstrate the utility of HANSEN, we perform Authorship (AA) & Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models.
arXiv Detail & Related papers (2023-10-25T16:23:17Z) - Towards Possibilities & Impossibilities of AI-generated Text Detection:
A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses.
Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs.
To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z) - Generative AI Text Classification using Ensemble LLM Approaches [0.12483023446237698]
Large Language Models (LLMs) have shown impressive performance across a variety of AI and natural language processing tasks.
We propose an ensemble neural model that generates probabilities from different pre-trained LLMs.
For the first task of distinguishing between AI and human generated text, our model ranked in fifth and thirteenth place.
arXiv Detail & Related papers (2023-09-14T14:41:46Z) - The Imitation Game: Detecting Human and AI-Generated Texts in the Era of
ChatGPT and BARD [3.2228025627337864]
We introduce a novel dataset of human-written and AI-generated texts in different genres.
We employ several machine learning models to classify the texts.
Results demonstrate the efficacy of these models in discerning between human and AI-generated text.
arXiv Detail & Related papers (2023-07-22T21:00:14Z) - M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box
Machine-Generated Text Detection [69.29017069438228]
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries.
This has also raised concerns about the potential misuse of such texts in journalism, education, and academia.
In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse.
arXiv Detail & Related papers (2023-05-24T08:55:11Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.