Paraphrase Identification with Deep Learning: A Review of Datasets and Methods
- URL: http://arxiv.org/abs/2212.06933v3
- Date: Tue, 08 Oct 2024 03:29:14 GMT
- Title: Paraphrase Identification with Deep Learning: A Review of Datasets and Methods
- Authors: Chao Zhou, Cheng Qiu, Lizhen Liang, Daniel E. Acuna,
- Abstract summary: We investigate how the under-representation of certain paraphrase types in popular datasets affects the ability to detect plagiarism.
We introduce and validate a new refined typology for paraphrases.
We propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
- Score: 1.4325734372991794
- License:
- Abstract: The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant risks to the credibility of various media forms if they are employed for paraphrased plagiarism -- one of the most subtle forms of content misuse in scientific literature and general text media. Although automated methods for paraphrase identification have been developed, detecting this type of plagiarism remains challenging due to the inconsistent nature of the datasets used to train these methods. In this article, we examine traditional and contemporary approaches to paraphrase identification, investigating how the under-representation of certain paraphrase types in popular datasets, including those used to train Large Language Models (LLMs), affects the ability to detect plagiarism. We introduce and validate a new refined typology for paraphrases (ReParaphrased, REfined PARAPHRASE typology definitions) to better understand the disparities in paraphrase type representation. Lastly, we propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
Related papers
- Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing [0.0]
It is crucial to design robust plagiarism detection systems tailored for low-resource languages.
This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts.
arXiv Detail & Related papers (2025-01-09T14:14:18Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Analysis of Plan-based Retrieval for Grounded Text Generation [78.89478272104739]
hallucinations occur when a language model is given a generation task outside its parametric knowledge.
A common strategy to address this limitation is to infuse the language models with retrieval mechanisms.
We analyze how planning can be used to guide retrieval to further reduce the frequency of hallucinations.
arXiv Detail & Related papers (2024-08-20T02:19:35Z) - Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Paraphrase Types for Generation and Detection [7.800428507692341]
We name these tasks Paraphrase Type Generation and Paraphrase Type Detection.
Our results suggest that while current techniques perform well in a binary classification scenario, the inclusion of fine-grained paraphrase types poses a significant challenge.
We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.
arXiv Detail & Related papers (2023-10-23T12:32:41Z) - Representation Learning for Resource-Constrained Keyphrase Generation [78.02577815973764]
We introduce salient span recovery and salient span prediction as guided denoising language modeling objectives.
We show the effectiveness of the proposed approach for low-resource and zero-shot keyphrase generation.
arXiv Detail & Related papers (2022-03-15T17:48:04Z) - A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks.
It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z) - Data Expansion using Back Translation and Paraphrasing for Hate Speech
Detection [1.192436948211501]
We present a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation.
We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset.
arXiv Detail & Related papers (2021-05-25T09:52:42Z) - SmartPatch: Improving Handwritten Word Imitation with Patch
Discriminators [67.54204685189255]
We propose SmartPatch, a new technique increasing the performance of current state-of-the-art methods.
We combine the well-known patch loss with information gathered from the parallel trained handwritten text recognition system.
This leads to a more enhanced local discriminator and results in more realistic and higher-quality generated handwritten words.
arXiv Detail & Related papers (2021-05-21T18:34:21Z) - Are Neural Language Models Good Plagiarists? A Benchmark for Neural
Paraphrase Detection [5.847824494580938]
We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture.
Our contribution fosters future research of paraphrase detection systems as it offers a large collection of aligned original and paraphrased documents.
arXiv Detail & Related papers (2021-03-23T11:01:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.