Paraphrase Identification with Deep Learning: A Review of Datasets and
Methods
- URL: http://arxiv.org/abs/2212.06933v1
- Date: Tue, 13 Dec 2022 23:06:20 GMT
- Title: Paraphrase Identification with Deep Learning: A Review of Datasets and
Methods
- Authors: Chao Zhou (Department of Computer Science, Syracuse University), Cheng
Qiu (School of Arts and Science, Vanderbilt University), Daniel E. Acuna
(Department of Computer Science, University of Colorado at Boulder)
- Abstract summary: Text generation tools like GPT-3 and ChatGPT can pose serious threat to the credibility of various forms of media.
detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of AI technology has made text generation tools like
GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can
pose serious threat to the credibility of various forms of media if these
technologies are used for plagiarism, including scientific literature and news
sources. Despite the development of automated methods for paraphrase
identification, detecting this type of plagiarism remains a challenge due to
the disparate nature of the datasets on which these methods are trained. In
this study, we review traditional and current approaches to paraphrase
identification and propose a refined typology of paraphrases. We also
investigate how this typology is represented in popular datasets and how
under-representation of certain types of paraphrases impacts detection
capabilities. Finally, we outline new directions for future research and
datasets in the pursuit of more effective paraphrase detection using AI.
Related papers
- Deepfake tweets automatic detection [0.0]
This study uses advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts.
By developing reliable methods for detecting AI-generated misinformation, this work contributes to a more trustworthy online information environment.
arXiv Detail & Related papers (2024-06-24T09:55:31Z) - Detecting AI-Generated Text: Factors Influencing Detectability with Current Methods [13.14749943120523]
Knowing whether a text was produced by human or artificial intelligence (AI) is important to determining its trustworthiness.
State-of-the art approaches to AIGT detection include watermarking, statistical and stylistic analysis, and machine learning classification.
We aim to provide insight into the salient factors that combine to determine how "detectable" AIGT text is under different scenarios.
arXiv Detail & Related papers (2024-06-21T18:31:49Z) - Who Writes the Review, Human or AI? [0.36498648388765503]
This study proposes a methodology to accurately distinguish AI-generated and human-written book reviews.
Our approach utilizes transfer learning, enabling the model to identify generated text across different topics.
The experimental results demonstrate that it is feasible to detect the original source of text, achieving an accuracy rate of 96.86%.
arXiv Detail & Related papers (2024-05-30T17:38:44Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Towards Possibilities & Impossibilities of AI-generated Text Detection:
A Survey [97.33926242130732]
Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses.
Despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs.
To address these concerns, a consensus among the research community is to develop algorithmic solutions to detect AI-generated text.
arXiv Detail & Related papers (2023-10-23T18:11:32Z) - Watermarking Conditional Text Generation for AI Detection: Unveiling
Challenges and a Semantic-Aware Watermark Remedy [52.765898203824975]
We introduce a semantic-aware watermarking algorithm that considers the characteristics of conditional text generation and the input context.
Experimental results demonstrate that our proposed method yields substantial improvements across various text generation models.
arXiv Detail & Related papers (2023-07-25T20:24:22Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - A survey on text generation using generative adversarial networks [0.0]
This work presents a thorough review concerning recent studies and text generation advancements using Generative Adversarial Networks.
The usage of adversarial learning for text generation is promising as it provides alternatives to generate the so-called "natural" language.
arXiv Detail & Related papers (2022-12-20T17:54:08Z) - SynSciPass: detecting appropriate uses of scientific text generation [0.0]
We develop a framework for dataset development that provides a nuanced approach to detecting machine generated text.
By training the same model that performed well on DAGPap22 on SynSciPass, we show that not only is the model more robust to domain shifts but also is able to uncover the type of technology used for machine generated text.
arXiv Detail & Related papers (2022-09-07T13:16:40Z) - Panning for gold: Lessons learned from the platform-agnostic automated
detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms.
We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks.
Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.