DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation
- URL: http://arxiv.org/abs/2601.17823v1
- Date: Sun, 25 Jan 2026 13:08:43 GMT
- Title: DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation
- Authors: Pranav Kasela, Marco Braga, Alessandro Ghiotto, Andrea Pilzer, Marco Viviani, Alessandro Raganato,
- Abstract summary: DIETA is a small, decoder-only Transformer model with 0.5 billion parameters.<n>We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs.<n>We release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles.
- Score: 74.85762984118024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian-English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web-crawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian-English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian-English machine translation. https://github.com/pkasela/DIETA-Machine-Translation
Related papers
- CANTONMT: Investigating Back-Translation and Model-Switch Mechanisms for Cantonese-English Neural Machine Translation [9.244878233604819]
This paper investigates the development and evaluation of machine translation models from Cantonese to English.
A new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning.
A monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation.
arXiv Detail & Related papers (2024-05-13T20:37:04Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - IT5: Text-to-text Pretraining for Italian Language Understanding and Generation [16.8189104967888]
We introduce IT5, the first family of encoder-decoder transformer models pretrained specifically on Italian.
We then introduce the ItaGen benchmark, which includes a broad range of natural language understanding and generation tasks for Italian.
We find monolingual IT5 models to provide the best scale-to-performance ratio across tested models.
arXiv Detail & Related papers (2022-03-07T22:39:01Z) - mT5: A massively multilingual pre-trained text-to-text transformer [60.0210636815514]
"Text-to-Text Transfer Transformer" (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on English-language NLP tasks.
We introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
arXiv Detail & Related papers (2020-10-22T17:58:14Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - Lite Training Strategies for Portuguese-English and English-Portuguese
Translation [67.4894325619275]
We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks.
We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents.
Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
arXiv Detail & Related papers (2020-08-20T04:31:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.