Related papers: HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?

HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?

URL: http://arxiv.org/abs/2402.11815v2
Date: Wed, 27 Mar 2024 20:30:08 GMT
Title: HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?
Authors: Shubhashis Roy Dipta, Sadat Shahriar,
Abstract summary: This paper describes our system developed for SemEval-2024 Task 8, Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection'' Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper describes our system developed for SemEval-2024 Task 8, ``Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection'' Machine-generated texts have been one of the main concerns due to the use of large language models (LLM) in fake text generation, phishing, cheating in exams, or even plagiarizing copyright materials. A lot of systems have been developed to detect machine-generated text. Nonetheless, the majority of these systems rely on the text-generating model. This limitation is impractical in real-world scenarios, as it's often impossible to know which specific model the user has used for text generation. In this work, we propose a $\textbf{single}$ model based on contrastive learning, which uses $\textbf{$\approx$40% of the baseline's parameters}$ (149M vs. 355M) but shows a comparable performance on the test dataset $(\textbf{21st out of 137 participants})$. Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning. Our code is publicly available at https://github.com/dipta007/SemEval24-Task8.

Related papers

Reasoning to Attend: Try to Understand How <SEG> Token Works [44.33848900059659]
We show that the $texttSEG>$ token contributes to semantic similarity within image-text pairs. We present READ, which facilitates LMMs' resilient $textbfREA$soning capability of where to atten$textbfD$ under the guidance of highly activated points.
arXiv Detail & Related papers (2024-12-23T17:44:05Z)
Mitigating Paraphrase Attacks on Machine-Text Detectors via Paraphrase Inversion [4.148732457277201]
High-quality paraphrases are easy to produce using instruction-tuned language models. $$x2013$are known to significantly degrade the performance of machine-text detectors. We propose an approach which frames the problem as paraphrasing from paraphrased text back to the original text.
arXiv Detail & Related papers (2024-10-29T00:46:24Z)
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection [87.43727192273772]
It is often hard to tell whether a piece of text was human-written or machine-generated. We present LLM-DetectAIve, designed for fine-grained detection. It supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished.
arXiv Detail & Related papers (2024-08-08T07:43:17Z)
Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text [7.959800630494841]
SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. We propose a RoBERTa-BiLSTM based classifier designed to classify text into two categories: AI-generated or human. Our architecture ranked 46th on the official leaderboard with an accuracy of 80.83 among 125.
arXiv Detail & Related papers (2024-07-03T10:22:23Z)
Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A major public concern regarding the training of large language models (LLMs) is whether they abusing copyrighted online text. Previous membership inference methods may be misled by similar examples in vast amounts of training data. We propose an alternative textitinsert-and-detection methodology, advocating that web users and content platforms employ textbftextitunique identifiers.
arXiv Detail & Related papers (2024-03-23T06:36:32Z)
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering. We utilize the language model within the diffusion model to encode the position and texts at the line level. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z)
Text Embeddings Reveal (Almost) As Much As Text [86.5822042193058]
We investigate the problem of embedding textitinversion, reconstructing the full text represented in dense text embeddings. We find that although a na"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92%$ of $32text-token$ text inputs exactly.
arXiv Detail & Related papers (2023-10-10T17:39:03Z)
TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles [14.205559299967423]
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. Users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired. We propose TopFormer to improve existing AA solutions by capturing more linguistic patterns in deepfake texts.
arXiv Detail & Related papers (2023-09-22T15:32:49Z)
Multiscale Positive-Unlabeled Detection of AI-Generated Texts [27.956604193427772]
Multiscale Positive-Unlabeled (MPU) training framework is proposed to address the difficulty of short-text detection. MPU method augments detection performance on long AI-generated texts, and significantly improves short-text detection of language model detectors.
arXiv Detail & Related papers (2023-05-29T15:25:00Z)
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection [69.29017069438228]
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. This has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse.
arXiv Detail & Related papers (2023-05-24T08:55:11Z)
Smaller Language Models are Better Black-box Machine-Generated Text Detectors [56.36291277897995]
Small and partially-trained models are better universal text detectors. We find that whether the detector and generator were trained on the same data is not critically important to the detection success. For instance, the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
arXiv Detail & Related papers (2023-05-17T00:09:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.