Fine-tuning Large Language Models for Multigenerator, Multidomain, and
Multilingual Machine-Generated Text Detection
- URL: http://arxiv.org/abs/2401.12326v1
- Date: Mon, 22 Jan 2024 19:39:05 GMT
- Title: Fine-tuning Large Language Models for Multigenerator, Multidomain, and
Multilingual Machine-Generated Text Detection
- Authors: Feng Xiong, Thanet Markchom, Ziwei Zheng, Subin Jung, Varun Ojha,
Huizhi Liang
- Abstract summary: SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse Large Language Models (LLMs)
The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C)
- Score: 3.6433784431752434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: SemEval-2024 Task 8 introduces the challenge of identifying machine-generated
texts from diverse Large Language Models (LLMs) in various languages and
domains. The task comprises three subtasks: binary classification in
monolingual and multilingual (Subtask A), multi-class classification (Subtask
B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B.
Each subtask is supported by three datasets for training, development, and
testing. To tackle this task, two methods: 1) using traditional machine
learning (ML) with natural language preprocessing (NLP) for feature extraction,
and 2) fine-tuning LLMs for text classification. The results show that
transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in
effectiveness, with majority voting being particularly effective in
multilingual contexts for identifying machine-generated texts.
Related papers
- SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection [68.858931667807]
Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine.
Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM.
Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine.
arXiv Detail & Related papers (2024-04-22T13:56:07Z) - Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings [22.71166607645311]
We introduce a novel suite of state-of-the-art bilingual text embedding models.
These models are capable of processing lengthy text inputs with up to 8192 tokens.
We have significantly improved the model performance on STS tasks.
We have expanded the Massive Text Embedding Benchmark to include benchmarks for German and Spanish embedding models.
arXiv Detail & Related papers (2024-02-26T20:53:12Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - SheffieldVeraAI at SemEval-2023 Task 3: Mono and multilingual approaches
for news genre, topic and persuasion technique classification [3.503844033591702]
This paper describes our approach for SemEval-2023 Task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup.
arXiv Detail & Related papers (2023-03-16T15:54:23Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.