Paraphrase Identification with Deep Learning: A Review of Datasets and
Methods
- URL: http://arxiv.org/abs/2212.06933v1
- Date: Tue, 13 Dec 2022 23:06:20 GMT
- Title: Paraphrase Identification with Deep Learning: A Review of Datasets and
Methods
- Authors: Chao Zhou (Department of Computer Science, Syracuse University), Cheng
Qiu (School of Arts and Science, Vanderbilt University), Daniel E. Acuna
(Department of Computer Science, University of Colorado at Boulder)
- Abstract summary: Text generation tools like GPT-3 and ChatGPT can pose serious threat to the credibility of various forms of media.
detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of AI technology has made text generation tools like
GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can
pose serious threat to the credibility of various forms of media if these
technologies are used for plagiarism, including scientific literature and news
sources. Despite the development of automated methods for paraphrase
identification, detecting this type of plagiarism remains a challenge due to
the disparate nature of the datasets on which these methods are trained. In
this study, we review traditional and current approaches to paraphrase
identification and propose a refined typology of paraphrases. We also
investigate how this typology is represented in popular datasets and how
under-representation of certain types of paraphrases impacts detection
capabilities. Finally, we outline new directions for future research and
datasets in the pursuit of more effective paraphrase detection using AI.
Related papers
- Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing [0.0]
It is crucial to design robust plagiarism detection systems tailored for low-resource languages.
This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts.
arXiv Detail & Related papers (2025-01-09T14:14:18Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Analysis of Plan-based Retrieval for Grounded Text Generation [78.89478272104739]
hallucinations occur when a language model is given a generation task outside its parametric knowledge.
A common strategy to address this limitation is to infuse the language models with retrieval mechanisms.
We analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations.
arXiv Detail & Related papers (2024-08-20T02:19:35Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Paraphrase Types for Generation and Detection [7.800428507692341]
We name these tasks Paraphrase Type Generation and Paraphrase Type Detection.
Our results suggest that while current techniques perform well in a binary classification scenario, the inclusion of fine-grained paraphrase types poses a significant challenge.
We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.
arXiv Detail & Related papers (2023-10-23T12:32:41Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - SynSciPass: detecting appropriate uses of scientific text generation [0.0]
We develop a framework for dataset development that provides a nuanced approach to detecting machine generated text.
By training the same model that performed well on DAGPap22 on SynSciPass, we show that not only is the model more robust to domain shifts but also is able to uncover the type of technology used for machine generated text.
arXiv Detail & Related papers (2022-09-07T13:16:40Z) - Representation Learning for Resource-Constrained Keyphrase Generation [78.02577815973764]
We introduce salient span recovery and salient span prediction as guided denoising language modeling objectives.
We show the effectiveness of the proposed approach for low-resource and zero-shot keyphrase generation.
arXiv Detail & Related papers (2022-03-15T17:48:04Z) - A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks.
It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z) - Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z) - SmartPatch: Improving Handwritten Word Imitation with Patch
Discriminators [67.54204685189255]
We propose SmartPatch, a new technique increasing the performance of current state-of-the-art methods.
We combine the well-known patch loss with information gathered from the parallel trained handwritten text recognition system.
This leads to a more enhanced local discriminator and results in more realistic and higher-quality generated handwritten words.
arXiv Detail & Related papers (2021-05-21T18:34:21Z) - Are Neural Language Models Good Plagiarists? A Benchmark for Neural
Paraphrase Detection [5.847824494580938]
We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture.
Our contribution fosters future research of paraphrase detection systems as it offers a large collection of aligned original and paraphrased documents.
arXiv Detail & Related papers (2021-03-23T11:01:35Z) - MICE: Mining Idioms with Contextual Embeddings [0.0]
MICEatic expressions can be problematic for natural language processing applications.
We present an approach that uses contextual embeddings for that purpose.
We show that deep neural networks using both embeddings perform much better than existing approaches.
arXiv Detail & Related papers (2020-08-13T08:56:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.