idT5: Indonesian Version of Multilingual T5 Transformer
- URL: http://arxiv.org/abs/2302.00856v2
- Date: Thu, 9 Nov 2023 08:47:19 GMT
- Title: idT5: Indonesian Version of Multilingual T5 Transformer
- Authors: Mukhlish Fuadi, Adhi Dharma Wibawa, Surya Sumpeno
- Abstract summary: Indonesian is spoken by almost 200 million people and is the 10th most spoken language in the world.
In this study, the mT5 model was adapted for only one language, Indonesian, resulting in a pre-trained T5 model that was specific only for Indonesian with a smaller size.
Fine-tuned model based on our model achieved 77.18% accuracy on SA, 8% higher than the mT5-based model, and obtained nearly the same score as the mT5-based model on QG and QA.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Indonesian language is spoken by almost 200 million people and is the 10th
most spoken language in the world, but it is under-represented in NLP (Natural
Language Processing) research. A sparsity of language resources has hampered
previous work on Indonesian. The Transformer is a new architecture rapidly
becoming dominant for NLP, surpassing alternatives like convolutional and
recurrent neural networks. T5 (Text-to-Text Transfer Transformer) is a
Transformer model that converts all text-based language problems to
text-to-text format for English. The multilingual variant is mT5 (multilingual
T5) which has shown promising results on many NLP tasks across languages.
However, the size of this multilingual model is a drawback for its application
in real production applications, which sometimes require only one language. In
this study, the mT5 model was adapted for only one language, Indonesian,
resulting in a pre-trained T5 model that was specific only for Indonesian with
a smaller size. For performance comparison, we fine-tuned this model and the
mT5 model to the Sentiment Analysis (SA), Question Generation (QG), and
Question Answering (QA) tasks with the exact mechanism and dataset. Fine-tuned
model based on our model achieved 77.18% accuracy on SA, 8% higher than the
mT5-based model, and obtained nearly the same score as the mT5-based model on
QG and QA. The results confirm that it is possible to produce a smaller
pre-trained model that maintains comparable yields while reducing the model
size by up to 58%. In addition, the resulting model requires less memory, loads
faster, and inference times faster.
Related papers
- Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - T5lephone: Bridging Speech and Text Self-supervised Models for Spoken
Language Understanding via Phoneme level T5 [65.32642587901903]
We conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task.
We extend the idea to create T5lephone, a variant of T5 that is pretrained using phonemicized text.
arXiv Detail & Related papers (2022-11-01T17:00:23Z) - Beyond English-Centric Bitexts for Better Multilingual Language
Representation Learning [99.42850643947439]
We show that going beyond English-centric bitexts, coupled with a novel sampling strategy, substantially boosts performance across model sizes.
Our XY-LENT XL variant outperforms XLM-RXXL and exhibits competitive performance with mT5 XXL while being 5x and 6x smaller respectively.
arXiv Detail & Related papers (2022-10-26T17:16:52Z) - Sequence to sequence pretraining for a less-resourced Slovenian language [0.0]
We train two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior.
Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are to be considered for the generative tasks.
arXiv Detail & Related papers (2022-07-28T10:08:50Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - Transferring Monolingual Model to Low-Resource Language: The Case of
Tigrinya [0.0]
We propose a cost-effective transfer learning method to adopt a strong source language model.
With only 10k examples of the given Tigrinya sentiment analysis dataset, English XLNet has achieved 78.88% F1-Score.
Fine-tuning (English) XLNet model on the CLS dataset has promising results compared to mBERT and even outperformed mBERT for one dataset of the Japanese language.
arXiv Detail & Related papers (2020-06-13T18:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.