IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
- URL: http://arxiv.org/abs/2203.03759v2
- Date: Mon, 20 May 2024 13:19:08 GMT
- Title: IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
- Authors: Gabriele Sarti, Malvina Nissim,
- Abstract summary: We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian.
We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian.
We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models.
- Score: 16.8189104967888
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian. We document and perform a thorough cleaning procedure for a large Italian corpus and use it to pretrain four IT5 model sizes. We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian, and use it to evaluate the performance of IT5 models and multilingual baselines. We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models, consistently outperforming their multilingual counterparts and setting a new state-of-the-art for Italian language generation.
Related papers
- DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation [74.85762984118024]
DIETA is a small, decoder-only Transformer model with 0.5 billion parameters.<n>We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs.<n>We release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles.
arXiv Detail & Related papers (2026-01-25T13:08:43Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - Exploring Large Language Models for Classical Philology [17.856304057963776]
We create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages.
We evaluate all models on morphological and syntactic tasks, including lemmatization.
Results show that our models provide significant improvements over the SoTA.
arXiv Detail & Related papers (2023-05-23T05:21:02Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Sequence to sequence pretraining for a less-resourced Slovenian language [0.0]
We train two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior.
Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are to be considered for the generative tasks.
arXiv Detail & Related papers (2022-07-28T10:08:50Z) - Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish.
The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering.
We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z) - IndT5: A Text-to-Text Transformer for 10 Indigenous Languages [7.952582509792971]
We introduce IndT5, the first Transformer language model for Indigenous languages.
We build IndCorpus--a new dataset for ten Indigenous languages and Spanish.
We present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages.
arXiv Detail & Related papers (2021-04-04T07:09:09Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.