SynSciPass: detecting appropriate uses of scientific text generation
- URL: http://arxiv.org/abs/2209.03742v1
- Date: Wed, 7 Sep 2022 13:16:40 GMT
- Title: SynSciPass: detecting appropriate uses of scientific text generation
- Authors: Domenic Rosati
- Abstract summary: We develop a framework for dataset development that provides a nuanced approach to detecting machine generated text.
By training the same model that performed well on DAGPap22 on SynSciPass, we show that not only is the model more robust to domain shifts but also is able to uncover the type of technology used for machine generated text.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Approaches to machine generated text detection tend to focus on binary
classification of human versus machine written text. In the scientific domain
where publishers might use these models to examine manuscripts under
submission, misclassification has the potential to cause harm to authors.
Additionally, authors may appropriately use text generation models such as with
the use of assistive technologies like translation tools. In this setting, a
binary classification scheme might be used to flag appropriate uses of
assistive text generation technology as simply machine generated which is a
cause of concern. In our work, we simulate this scenario by presenting a
state-of-the-art detector trained on the DAGPap22 with machine translated
passages from Scielo and find that the model performs at random. Given this
finding, we develop a framework for dataset development that provides a nuanced
approach to detecting machine generated text by having labels for the type of
technology used such as for translation or paraphrase resulting in the
construction of SynSciPass. By training the same model that performed well on
DAGPap22 on SynSciPass, we show that not only is the model more robust to
domain shifts but also is able to uncover the type of technology used for
machine generated text. Despite this, we conclude that current datasets are
neither comprehensive nor realistic enough to understand how these models would
perform in the wild where manuscript submissions can come from many unknown or
novel distributions, how they would perform on scientific full-texts rather
than small passages, and what might happen when there is a mix of appropriate
and inappropriate uses of natural language generation.
Related papers
- RKadiyala at SemEval-2024 Task 8: Black-Box Word-Level Text Boundary Detection in Partially Machine Generated Texts [0.0]
This paper introduces few reliable approaches for identifying which part of a given text is machine generated at a word level.
We present a comparison with proprietary systems, performance of our model on unseen domains' and generators' texts.
The findings reveal significant improvements in detection accuracy along with comparison on other aspects of detection capabilities.
arXiv Detail & Related papers (2024-10-22T03:21:59Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Few-Shot Detection of Machine-Generated Text using Style Representations [4.326503887981912]
Language models that convincingly mimic human writing pose a significant risk of abuse.
We propose to leverage representations of writing style estimated from human-authored text.
We find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors.
arXiv Detail & Related papers (2024-01-12T17:26:51Z) - Detection of Machine-Generated Text: Literature Survey [0.0]
This literature survey aims to compile and synthesize accomplishments and developments in the field of machine-generated text.
It also gives an overview of machine-generated text trends and explores the larger societal implications.
arXiv Detail & Related papers (2024-01-02T01:44:15Z) - Smaller Language Models are Better Black-box Machine-Generated Text
Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors.
We find that whether the detector and generator were trained on the same data is not critically important to the detection success.
For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z) - Paraphrase Identification with Deep Learning: A Review of Datasets and Methods [1.4325734372991794]
We investigate how the under-representation of certain paraphrase types in popular datasets affects the ability to detect plagiarism.
We introduce and validate a new refined typology for paraphrases.
We propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
arXiv Detail & Related papers (2022-12-13T23:06:20Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.