AraT5: Text-to-Text Transformers for Arabic Language Understanding and
Generation
- URL: http://arxiv.org/abs/2109.12068v1
- Date: Tue, 31 Aug 2021 02:02:10 GMT
- Title: AraT5: Text-to-Text Transformers for Arabic Language Understanding and
Generation
- Authors: El Moatez Billah Nagoudi and AbdelRahim Elmadany and Muhammad
Abdul-Mageed
- Abstract summary: We introduce a new benchmark for Arabic language generation (ARGEN)
We pre-train three powerful Arabic-specific text-to-text Transformer based models and evaluate them on the two benchmarks.
Our new models perform significantly better than mT5 and exceed MARBERT, the current state-of-the-art Arabic BERT-based model, on Arabic language understanding.
- Score: 6.021269454707625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transfer learning with a unified Transformer framework (T5) that converts all
language problems into a text-to-text format has recently been proposed as a
simple, yet effective, transfer learning approach. Although a multilingual
version of the T5 model (mT5) has been introduced, it is not clear how well it
can fare on non-English tasks involving diverse data. To investigate this
question, we apply mT5 on a language with a wide variety of dialects--Arabic.
For evaluation, we use an existing benchmark for Arabic language understanding
and introduce a new benchmark for Arabic language generation (ARGEN). We also
pre-train three powerful Arabic-specific text-to-text Transformer based models
and evaluate them on the two benchmarks. Our new models perform significantly
better than mT5 and exceed MARBERT, the current state-of-the-art Arabic
BERT-based model, on Arabic language understanding. The models also set new
SOTA on the generation benchmark. Our new models and are publicly released at
https://github.com/UBC-NLP/araT5 and ARLGE will be released through the same
repository.
Related papers
- Arabic Automatic Story Generation with Large Language Models [15.000055598698438]
We focus on the task of generating stories from large language models (LLMs)
For our training, we use stories acquired through machine translation (MT) as well as GPT-4.
For our GPT-41 data, we introduce crafted prompts that allow us to generate data well-suited to the Arabic context.
arXiv Detail & Related papers (2024-07-10T11:26:10Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - ArTST: Arabic Text and Speech Transformer [2.53638770809417]
We present ArTST, a pre-trained Arabic text and speech transformer.
It supports open-source speech technologies for the Arabic language.
arXiv Detail & Related papers (2023-10-25T13:20:54Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - idT5: Indonesian Version of Multilingual T5 Transformer [0.0]
Indonesian is spoken by almost 200 million people and is the 10th most spoken language in the world.
In this study, the mT5 model was adapted for only one language, Indonesian, resulting in a pre-trained T5 model that was specific only for Indonesian with a smaller size.
Fine-tuned model based on our model achieved 77.18% accuracy on SA, 8% higher than the mT5-based model, and obtained nearly the same score as the mT5-based model on QG and QA.
arXiv Detail & Related papers (2023-02-02T03:56:16Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - mT6: Multilingual Pretrained Text-to-Text Transformer with Translation
Pairs [51.67970832510462]
We improve multilingual text-to-text transfer Transformer with translation pairs (mT6)
We explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption.
Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.
arXiv Detail & Related papers (2021-04-18T03:24:07Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.