Related papers: NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

URL: http://arxiv.org/abs/2205.15960v2
Date: Wed, 12 Apr 2023 16:42:53 GMT
Title: NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Authors: Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder
Abstract summary: We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
Score: 100.59889279607432
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.

Related papers

Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies [11.52881045684005]
This tutorial is designed for NLP practitioners, researchers, and developers working with multilingual and low-resource languages.<n>Participants will walk away with a practical toolkit for building end-to-end NLP pipelines for underrepresented languages.
arXiv Detail & Related papers (2025-12-16T16:44:17Z)
FormosanBench: Benchmarking Low-Resource Austronesian Languages in the Era of Large Language Models [1.2403152094314245]
We introduce FORMOSANBENCH, the first benchmark for evaluating large language models (LLMs) on low-resource Austronesian languages.<n>We assess model performance in zero-shot, 10-shot, and fine-tuned settings using FORMOSANBENCH.<n>Our results reveal a substantial performance gap between high-resource and Formosan languages.
arXiv Detail & Related papers (2025-06-12T07:02:28Z)
DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives [6.599829213637133]
Indonesia is one of the most diverse countries linguistically. Despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing research and technologies. We propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia.
arXiv Detail & Related papers (2024-11-14T10:00:33Z)
Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
GlobalBench: A Benchmark for Global Progress in Natural Language Processing [114.24519009839142]
GlobalBench aims to track progress on all NLP datasets in all languages. Tracks estimated per-speaker utility and equity of technology across all languages. Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
arXiv Detail & Related papers (2023-05-24T04:36:32Z)
NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z)
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z)
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding [41.691861010118394]
We introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B.
arXiv Detail & Related papers (2020-09-11T12:21:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.