Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to
Document Level
- URL: http://arxiv.org/abs/2306.08122v1
- Date: Tue, 13 Jun 2023 20:34:55 GMT
- Title: Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to
Document Level
- Authors: Mujahid Ali Quidwai, Chunhui Li, Parijat Dube
- Abstract summary: Existing AI-generated text classifiers have limited accuracy and often produce false positives.
We propose a novel approach using natural language processing (NLP) techniques.
We generate multiple paraphrased versions of a given question and inputting them into the large language model to generate answers.
By using a contrastive loss function based on cosine similarity, we match generated sentences with those from the student's response.
- Score: 4.250876580245865
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The increasing reliance on large language models (LLMs) in academic writing
has led to a rise in plagiarism. Existing AI-generated text classifiers have
limited accuracy and often produce false positives. We propose a novel approach
using natural language processing (NLP) techniques, offering quantifiable
metrics at both sentence and document levels for easier interpretation by human
evaluators. Our method employs a multi-faceted approach, generating multiple
paraphrased versions of a given question and inputting them into the LLM to
generate answers. By using a contrastive loss function based on cosine
similarity, we match generated sentences with those from the student's
response. Our approach achieves up to 94% accuracy in classifying human and AI
text, providing a robust and adaptable solution for plagiarism detection in
academic settings. This method improves with LLM advancements, reducing the
need for new model training or reconfiguration, and offers a more transparent
way of evaluating and detecting AI-generated text.
Related papers
- DeTeCtive: Detecting AI-generated Text via Multi-Level Contrastive Learning [24.99797253885887]
We argue that the key to accomplishing this task lies in distinguishing writing styles of different authors.
We propose DeTeCtive, a multi-task auxiliary, multi-level contrastive learning framework.
Our method is compatible with a range of text encoders.
arXiv Detail & Related papers (2024-10-28T12:34:49Z) - Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - ESPERANTO: Evaluating Synthesized Phrases to Enhance Robustness in AI Detection for Text Origination [1.8418334324753884]
This paper introduces back-translation as a novel technique for evading detection.
We present a model that combines these back-translated texts to produce a manipulated version of the original AI-generated text.
We evaluate this technique on nine AI detectors, including six open-source and three proprietary systems.
arXiv Detail & Related papers (2024-09-22T01:13:22Z) - MOSAIC: Multiple Observers Spotting AI Content, a Robust Approach to Machine-Generated Text Detection [35.67613230687864]
Large Language Models (LLMs) are trained at scale and endowed with powerful text-generating abilities.
Various proposals have been made to automatically discriminate artificially generated from human-written texts.
We derive a new, theoretically grounded approach to combine their respective strengths.
Our experiments, using a variety of generator LLMs, suggest that our method effectively leads to robust detection performances.
arXiv Detail & Related papers (2024-09-11T20:55:12Z) - Learning to Rewrite: Generalized LLM-Generated Text Detection [19.9477991969521]
Large language models (LLMs) present significant risks when used to generate non-factual content and spread disinformation at scale.
We introduce Learning2Rewrite, a novel framework for detecting AI-generated text with exceptional generalization to unseen domains.
arXiv Detail & Related papers (2024-08-08T05:53:39Z) - ToBlend: Token-Level Blending With an Ensemble of LLMs to Attack AI-Generated Text Detection [6.27025292177391]
ToBlend is a novel token-level ensemble text generation method to challenge the robustness of current AI-content detection approaches.
We find ToBlend significantly drops the performance of most mainstream AI-content detection methods.
arXiv Detail & Related papers (2024-02-17T02:25:57Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - MAGE: Machine-generated Text Detection in the Wild [82.70561073277801]
Large language models (LLMs) have achieved human-level text generation, emphasizing the need for effective AI-generated text detection.
We build a comprehensive testbed by gathering texts from diverse human writings and texts generated by different LLMs.
Despite challenges, the top-performing detector can identify 86.54% out-of-domain texts generated by a new LLM, indicating the feasibility for application scenarios.
arXiv Detail & Related papers (2023-05-22T17:13:29Z) - DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection [56.513637720967566]
Large language models (LLMs) can generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets.
Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics.
We propose to extract deep intrinsic characteristics of the black-box model generated texts.
arXiv Detail & Related papers (2023-05-21T17:26:16Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts.
The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.