CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data
Limitation With Contrastive Learning
- URL: http://arxiv.org/abs/2212.10341v2
- Date: Fri, 20 Oct 2023 13:21:51 GMT
- Title: CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Data
Limitation With Contrastive Learning
- Authors: Xiaoming Liu, Zhaohan Zhang, Yichen Wang, Hang Pu, Yu Lan, Chao Shen
- Abstract summary: We present a coherence-based contrastive learning model named CoCo to detect the possible MGT under low-resource scenario.
To exploit the linguistic feature, we encode coherence information in form of graph into text representation.
Experiment results on two public datasets and two self-constructed datasets prove our approach outperforms the state-of-art methods significantly.
- Score: 14.637303913878435
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Machine-Generated Text (MGT) detection, a task that discriminates MGT from
Human-Written Text (HWT), plays a crucial role in preventing misuse of text
generative models, which excel in mimicking human writing style recently.
Latest proposed detectors usually take coarse text sequences as input and
fine-tune pretrained models with standard cross-entropy loss. However, these
methods fail to consider the linguistic structure of texts. Moreover, they lack
the ability to handle the low-resource problem which could often happen in
practice considering the enormous amount of textual data online. In this paper,
we present a coherence-based contrastive learning model named CoCo to detect
the possible MGT under low-resource scenario. To exploit the linguistic
feature, we encode coherence information in form of graph into text
representation. To tackle the challenges of low data resource, we employ a
contrastive learning framework and propose an improved contrastive loss for
preventing performance degradation brought by simple samples. The experiment
results on two public datasets and two self-constructed datasets prove our
approach outperforms the state-of-art methods significantly. Also, we
surprisingly find that MGTs originated from up-to-date language models could be
easier to detect than these from previous models, in our experiments. And we
propose some preliminary explanations for this counter-intuitive phenomena. All
the codes and datasets are open-sourced.
Related papers
- Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text
Generation [5.304395026626743]
Hallucination of text ungrounded in the input is a well-known problem in neural data-to-text generation.
We propose a new way to mitigate hallucinations by combining the probabilistic output of a generator language model with the output of a special "text critic"
Our method does not need any changes to the underlying LM's architecture or training procedure.
arXiv Detail & Related papers (2023-10-25T20:05:07Z) - WeCheck: Strong Factual Consistency Checker via Weakly Supervised
Learning [40.5830891229718]
We propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck.
Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
arXiv Detail & Related papers (2022-12-20T08:04:36Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Few-shot Text Classification with Dual Contrastive Consistency [31.141350717029358]
In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification.
We adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data.
arXiv Detail & Related papers (2022-09-29T19:26:23Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Search and Learn: Improving Semantic Coverage for Data-to-Text
Generation [30.07712039293558]
In this work, we focus on few-shot data-to-text generation.
We propose a search-and-learning approach that leverages pretrained language models but inserts the missing slots to improve semantic coverage.
Experiments show that our model achieves high performance on E2E and WikiBio datasets.
arXiv Detail & Related papers (2021-12-06T03:51:56Z) - Toward the Understanding of Deep Text Matching Models for Information
Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval.
Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint.
Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z) - Text Generation by Learning from Demonstrations [17.549815256968877]
Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation.
We propose GOLD: an easy-to-optimize algorithm that learns from expert demonstrations by importance weighting.
According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient.
arXiv Detail & Related papers (2020-09-16T17:58:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.