IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural
Language Understanding
- URL: http://arxiv.org/abs/2009.05387v3
- Date: Thu, 8 Oct 2020 13:11:59 GMT
- Title: IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural
Language Understanding
- Authors: Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel
Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra,
Pascale Fung, Syafri Bahar, Ayu Purwarianti
- Abstract summary: We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks.
IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity.
The datasets for the tasks lie in different domains and styles to ensure task diversity.
We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
- Score: 41.691861010118394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Indonesian is known to be the fourth most frequently used language
over the internet, the research progress on this language in the natural
language processing (NLP) is slow-moving due to a lack of available resources.
In response, we introduce the first-ever vast resource for the training,
evaluating, and benchmarking on Indonesian natural language understanding
(IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence
classification to pair-sentences sequence labeling with different levels of
complexity. The datasets for the tasks lie in different domains and styles to
ensure task diversity. We also provide a set of Indonesian pre-trained models
(IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected
from publicly available sources such as social media texts, blogs, news, and
websites. We release baseline models for all twelve tasks, as well as the
framework for benchmark evaluation, and thus it enables everyone to benchmark
their system performances.
Related papers
- NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in
Indonesian [0.0]
We construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences.
We then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning.
arXiv Detail & Related papers (2023-06-20T07:19:36Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural
Language Generation [45.90242600586664]
We introduce IndoNLG, the first such benchmark for the Indonesian language for natural language generation (NLG)
We provide a vast and clean pre-training corpus of Indonesian, Sundanese, and Javanese datasets called Indo4B-Plus.
We evaluate the effectiveness and efficiency of IndoBART by conducting extensive evaluation on all IndoNLG tasks.
arXiv Detail & Related papers (2021-04-16T16:16:44Z) - IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model
for Indonesian NLP [41.57622648924415]
The Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world.
Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization.
We release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM.
arXiv Detail & Related papers (2020-11-02T01:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.