A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
- URL: http://arxiv.org/abs/2510.13211v1
- Date: Wed, 15 Oct 2025 06:57:23 GMT
- Title: A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
- Authors: Prawaal Sharma, Navneet Goyal, Poonam Goyal, Vishnupriyan R,
- Abstract summary: This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles.<n>We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation.
- Score: 2.943391000885789
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.
Related papers
- Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation [3.3393607383304253]
We develop framework that screens source sentences to form efficient parallel text.<n>We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality.<n>This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA's utility in data augmentation.
arXiv Detail & Related papers (2026-01-13T15:05:19Z) - Exploring NLP Benchmarks in an Extremely Low-Resource Setting [21.656551146954587]
This paper focuses on Ladin, an endangered Romance language, specifically targeting the Val Badia variant.<n>We create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data.
arXiv Detail & Related papers (2025-09-04T07:41:23Z) - SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.<n>Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.<n>We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - Cross-lingual Text Classification Transfer: The Case of Ukrainian [11.508759658889382]
Ukrainian stands as a language that can benefit from the continued refinement of cross-lingual methodologies.<n>Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks.<n>In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods.
arXiv Detail & Related papers (2024-04-02T15:37:09Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Adapting High-resource NMT Models to Translate Low-resource Related
Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages.
In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data.
Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.