Related papers: What Do Indonesians Really Need from Language Technology? A Nationwide Survey

What Do Indonesians Really Need from Language Technology? A Nationwide Survey

URL: http://arxiv.org/abs/2506.07506v1
Date: Mon, 09 Jun 2025 07:36:15 GMT
Title: What Do Indonesians Really Need from Language Technology? A Nationwide Survey
Authors: Muhammad Dehan Al Kautsar, Lucky Susanto, Derry Wijaya, Fajri Koto,
Abstract summary: We conduct a nationwide survey to assess the actual needs of native speakers in Indonesia.<n>Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority.
Score: 8.339887237261031
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: There is an emerging effort to develop NLP for Indonesias 700+ local languages, but progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native speakers in Indonesia. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.

Related papers

Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo [0.815557531820863]
This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo.<n>Our project employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages.<n>We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets.
arXiv Detail & Related papers (2025-01-19T10:17:21Z)
Real-Time Multilingual Sign Language Processing [4.626189039960495]
Sign Language Processing (SLP) is an interdisciplinary field comprised of Natural Language Processing (NLP) and Computer Vision.<n>Traditional approaches have often been constrained by the use of gloss-based systems that are both language-specific and inadequate for capturing the multidimensional nature of sign language.<n>We propose the use of SignWiring, a universal sign language transcription notation system, to serve as an intermediary link between the visual-gestural modality of signed languages and text-based linguistic representations.
arXiv Detail & Related papers (2024-12-02T21:51:41Z)
LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.<n>Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.<n>We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation [0.0]
generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency.
arXiv Detail & Related papers (2024-04-14T04:25:41Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages. We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented. We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
Not always about you: Prioritizing community needs when developing endangered language technology [5.670857685983896]
We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
arXiv Detail & Related papers (2022-04-12T05:59:39Z)
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.