naab: A ready-to-use plug-and-play corpus for Farsi
- URL: http://arxiv.org/abs/2208.13486v2
- Date: Mon, 23 Dec 2024 07:29:49 GMT
- Title: naab: A ready-to-use plug-and-play corpus for Farsi
- Authors: Sadra Sabouri, Elnaz Rahmati, Soroush Gooran, Hossein Sameti,
- Abstract summary: naab is the largest publicly available, cleaned, and ready-to-use Farsi textual corpus.
Naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words.
Naab-raw is an unprocessed version of the dataset, along with a pre-processing toolkit.
- Score: 1.381198851698147
- License:
- Abstract: The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word NAAB (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource languages, to improve the performance of LLMs in their respective domains and bridge the gap between resource-rich and resource-poor languages.
Related papers
- Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus.
In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - High-quality Data-to-Text Generation for Severely Under-Resourced
Languages with Out-of-the-box Large Language Models [5.632410663467911]
We explore the extent to which pretrained large language models (LLMs) can bridge the performance gap for under-resourced languages.
We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins.
For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English.
arXiv Detail & Related papers (2024-02-19T16:29:40Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - ParsiNLU: A Suite of Language Understanding Challenges for Persian [23.26176232463948]
This work focuses on Persian language, one of the widely spoken languages in the world.
There are few NLU datasets available for this rich language.
ParsiNLU is the first benchmark in Persian language that includes a range of high-level tasks.
arXiv Detail & Related papers (2020-12-11T06:31:42Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Combining Pretrained High-Resource Embeddings and Subword
Representations for Low-Resource Languages [24.775371434410328]
We explore techniques exploiting the qualities of morphologically rich languages (MRLs)
We show that a meta-embedding approach combining both pretrained and morphologically-informed word embeddings performs best in the downstream task of Xhosa-English translation.
arXiv Detail & Related papers (2020-03-09T21:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.