NusaCrowd: Open Source Initiative for Indonesian NLP Resources
- URL: http://arxiv.org/abs/2212.09648v4
- Date: Fri, 21 Jul 2023 14:44:45 GMT
- Title: NusaCrowd: Open Source Initiative for Indonesian NLP Resources
- Authors: Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata,
Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa
Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan,
Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio
Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar
Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki
Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Farid
Adilazuarda, Ryan Ignatius, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari,
Wenliang Dai, Yan Xu, Dyah Damapuspita, Cuk Tho, Ichwanul Muslim Karo Karo,
Tirana Noor Fatyanosa, Ziwei Ji, Pascale Fung, Graham Neubig, Timothy
Baldwin, Sebastian Ruder, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti
- Abstract summary: NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
- Score: 104.5381571820792
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken.
Related papers
- Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian
Languages [20.051904366350293]
NusaCrowd strives to provide the largest crowdsourcing aggregation with standardized data loading for NLP tasks in all Indonesian languages.
By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia.
arXiv Detail & Related papers (2022-07-21T15:05:42Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model
for Indonesian NLP [41.57622648924415]
The Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world.
Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization.
We release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM.
arXiv Detail & Related papers (2020-11-02T01:54:56Z) - IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural
Language Understanding [41.691861010118394]
We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks.
IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity.
The datasets for the tasks lie in different domains and styles to ensure task diversity.
We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
arXiv Detail & Related papers (2020-09-11T12:21:41Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.