NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian
Languages
- URL: http://arxiv.org/abs/2207.10524v1
- Date: Thu, 21 Jul 2022 15:05:42 GMT
- Title: NusaCrowd: A Call for Open and Reproducible NLP Research in Indonesian
Languages
- Authors: Samuel Cahyawijaya, Alham Fikri Aji, Holy Lovenia, Genta Indra Winata,
Bryan Wilie, Rahmad Mahendra, Fajri Koto, David Moeljadi, Karissa Vincentio,
Ade Romadhony, Ayu Purwarianti
- Abstract summary: NusaCrowd strives to provide the largest crowdsourcing aggregation with standardized data loading for NLP tasks in all Indonesian languages.
By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia.
- Score: 20.051904366350293
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: At the center of the underlying issues that halt Indonesian natural language
processing (NLP) research advancement, we find data scarcity. Resources in
Indonesian languages, especially the local ones, are extremely scarce and
underrepresented. Many Indonesian researchers do not publish their dataset.
Furthermore, the few public datasets that we have are scattered across
different platforms, thus makes performing reproducible and data-centric
research in Indonesian NLP even more arduous. Rising to this challenge, we
initiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowd
strives to provide the largest datasheets aggregation with standardized data
loading for NLP tasks in all Indonesian languages. By enabling open and
centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle
the data scarcity problem hindering NLP progress in Indonesia and bring NLP
practitioners to move towards collaboration.
Related papers
- Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Beyond Good Intentions: Reporting the Research Landscape of NLP for
Social Good [115.1507728564964]
We introduce NLP4SG Papers, a scientific dataset with three associated tasks.
These tasks help identify NLP4SG papers and characterize the NLP4SG landscape.
We use state-of-the-art NLP models to address each of these tasks and use them on the entire ACL Anthology.
arXiv Detail & Related papers (2023-05-09T14:16:25Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - FedNLP: A Research Platform for Federated Learning in Natural Language
Processing [55.01246123092445]
We present the FedNLP, a research platform for federated learning in NLP.
FedNLP supports various popular task formulations in NLP such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling.
Preliminary experiments with FedNLP reveal that there exists a large performance gap between learning on decentralized and centralized datasets.
arXiv Detail & Related papers (2021-04-18T11:04:49Z) - IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model
for Indonesian NLP [41.57622648924415]
The Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world.
Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization.
We release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM.
arXiv Detail & Related papers (2020-11-02T01:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.