UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text
using Transformer Ensembles
- URL: http://arxiv.org/abs/2308.01408v1
- Date: Wed, 2 Aug 2023 20:08:59 GMT
- Title: UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text
using Transformer Ensembles
- Authors: Andrei-Alexandru Preda, Dumitru-Clementin Cercel, Traian Rebedea,
Costin-Gabriel Chiru
- Abstract summary: This paper describes the solutions submitted by the UPB team to the AuTexTification shared task, featured as part of IberLEF-2023.
Our best-performing model achieved macro F1-scores of 66.63% on the English dataset and 67.10% on the Spanish dataset.
- Score: 0.5324802812881543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes the solutions submitted by the UPB team to the
AuTexTification shared task, featured as part of IberLEF-2023. Our team
participated in the first subtask, identifying text documents produced by large
language models instead of humans. The organizers provided a bilingual dataset
for this subtask, comprising English and Spanish texts covering multiple
domains, such as legal texts, social media posts, and how-to articles. We
experimented mostly with deep learning models based on Transformers, as well as
training techniques such as multi-task learning and virtual adversarial
training to obtain better results. We submitted three runs, two of which
consisted of ensemble models. Our best-performing model achieved macro
F1-scores of 66.63% on the English dataset and 67.10% on the Spanish dataset.
Related papers
- Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings [22.71166607645311]
We introduce a novel suite of state-of-the-art bilingual text embedding models.
These models are capable of processing lengthy text inputs with up to 8192 tokens.
We have significantly improved the model performance on STS tasks.
We have expanded the Massive Text Embedding Benchmark to include benchmarks for German and Spanish embedding models.
arXiv Detail & Related papers (2024-02-26T20:53:12Z) - Overview of AuTexTification at IberLEF 2023: Detection and Attribution
of Machine-Generated Text in Multiple Domains [6.44756483013808]
This paper presents the overview of the AuTexTification task as part of the IberLEF 2023 Workshop in Iberian Languages Evaluation Forum.
Our AuTexTification dataset contains more than 160.000 texts across two languages (English and Spanish) and five domains (tweets, reviews, news, legal, and how-to articles)
A total of 114 teams signed up to participate, of which 36 sent 175 runs, and 20 of them sent their working notes.
arXiv Detail & Related papers (2023-09-20T13:10:06Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - BJTU-WeChat's Systems for the WMT22 Chat Translation Task [66.81525961469494]
This paper introduces the joint submission of the Beijing Jiaotong University and WeChat AI to the WMT'22 chat translation task for English-German.
Based on the Transformer, we apply several effective variants.
Our systems achieve 0.810 and 0.946 COMET scores.
arXiv Detail & Related papers (2022-11-28T02:35:04Z) - DOCmT5: Document-Level Pretraining of Multilingual Language Models [9.072507490639218]
We introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents.
We propose a simple and effective pre-training objective - Document Reordering Machine Translation.
DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks.
arXiv Detail & Related papers (2021-12-16T08:58:52Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective.
We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization.
For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.