Related papers: Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

URL: http://arxiv.org/abs/2411.12262v2
Date: Thu, 19 Dec 2024 07:29:23 GMT
Title: Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Authors: Raphael Merx, Adérito José Guterres Correia, Hanna Suominen, Ekaterina Vylomova,
Abstract summary: We propose an observational study on actual usage patterns of a specialized MT service for the Tetun language in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. Our results suggest that MT systems for minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts.
Score: 7.299910666525873
License:
Abstract: Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.

Related papers

Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
This survey provides a comprehensive overview of the current research on low-resource language misinformation detection. We review the existing datasets, methodologies, and tools used in these domains, identifying key challenges related to: data resources, model development, cultural and linguistic context, real-world applications, and research efforts. Our findings underscore the need for robust, inclusive systems capable of addressing misinformation across diverse linguistic and cultural contexts.
arXiv Detail & Related papers (2024-10-24T03:02:03Z)
Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem [4.830018386227]
This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials and parallel corpora.
arXiv Detail & Related papers (2024-06-21T20:02:22Z)
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia [4.634142034755327]
This study comprehensively analyzes training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances.
arXiv Detail & Related papers (2023-11-02T05:27:48Z)
Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language. In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
When Does Translation Require Context? A Data-driven, Multilingual Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT) Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation. We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
Survey of Low-Resource Machine Translation [65.52755521004794]
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available.
arXiv Detail & Related papers (2021-09-01T16:57:58Z)
FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT) The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone. We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z)
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages [15.859824747983556]
"Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. We propose participatory research as a means to involve all necessary agents required in the Machine Translation development process.
arXiv Detail & Related papers (2020-10-05T21:50:38Z)
A Comprehensive Survey of Multilingual Neural Machine Translation [22.96845346423759]
We present a survey on multilingual neural machine translation (MNMT) MNMT is more promising than its statistical machine translation counterpart because end-to-end modeling and distributed representations open new avenues for research on machine translation. We first categorize various approaches based on their central use-case and then further categorize them based on resource scenarios, underlying modeling principles, core-issues and challenges.
arXiv Detail & Related papers (2020-01-04T19:38:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.