Related papers: Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

URL: http://arxiv.org/abs/2411.12262v4
Date: Wed, 02 Apr 2025 13:56:42 GMT
Title: Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Authors: Raphael Merx, Adérito José Guterres Correia, Hanna Suominen, Ekaterina Vylomova,
Abstract summary: We propose an observational study on actual usage patterns of tetun$.$org, a specialized MT service for the Tetun language in Timor-Leste.<n>Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora.<n>Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts.
Score: 7.299910666525873
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of tetun$.$org, a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.

Related papers

Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
Misinformation transcends linguistic boundaries, posing a challenge for moderation systems. Most approaches to misinformation detection are monolingual, focused on high-resource languages. This survey provides a comprehensive overview of the current research on misinformation detection in low-resource languages.
arXiv Detail & Related papers (2024-10-24T03:02:03Z)
Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem [4.830018386227]
This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of context retrieved from a constrained database of digitized pedagogical materials and parallel corpora.
arXiv Detail & Related papers (2024-06-21T20:02:22Z)
MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation [61.65537912700187]
Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation (MT) We propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner.
arXiv Detail & Related papers (2024-03-14T16:07:39Z)
End-to-End Speech-to-Text Translation: A Survey [0.0]
Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation.
arXiv Detail & Related papers (2023-12-02T07:40:32Z)
Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia [4.634142034755327]
This study comprehensively analyzes training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances.
arXiv Detail & Related papers (2023-11-02T05:27:48Z)
Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language. In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z)
Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer in Low-Resource Dialog Generation [21.973937517854935]
Cross-lingual transfer is important for developing high-quality chatbots in multiple languages. In this work, we investigate whether it is helpful to utilize machine translation (MT) at all in this task. Experiments show that leveraging English dialog corpora can indeed improve the naturalness, relevance and cross-domain transferability in Chinese.
arXiv Detail & Related papers (2023-05-21T15:07:04Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation [91.57514888410205]
Large language models (LLMs) demonstrate remarkable machine translation (MT) abilities via prompting. LLMs can struggle to translate inputs with rare words, which are common in low resource or domain transfer scenarios. We show that LLM prompting can provide an effective solution for rare words as well, by using prior knowledge from bilingual dictionaries to provide control hints in the prompts.
arXiv Detail & Related papers (2023-02-15T18:46:42Z)
Prompting PaLM for Translation: Assessing Strategies and Performance [16.73524055296411]
pathways language model (PaLM) has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems.
arXiv Detail & Related papers (2022-11-16T18:42:37Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
When Does Translation Require Context? A Data-driven, Multilingual Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT) Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation. We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
Survey of Low-Resource Machine Translation [65.52755521004794]
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available.
arXiv Detail & Related papers (2021-09-01T16:57:58Z)
FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT) The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone. We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z)
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages [15.859824747983556]
"Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. We propose participatory research as a means to involve all necessary agents required in the Machine Translation development process.
arXiv Detail & Related papers (2020-10-05T21:50:38Z)
A Comprehensive Survey of Multilingual Neural Machine Translation [22.96845346423759]
We present a survey on multilingual neural machine translation (MNMT) MNMT is more promising than its statistical machine translation counterpart because end-to-end modeling and distributed representations open new avenues for research on machine translation. We first categorize various approaches based on their central use-case and then further categorize them based on resource scenarios, underlying modeling principles, core-issues and challenges.
arXiv Detail & Related papers (2020-01-04T19:38:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.