Related papers: The IgboAPI Dataset: Empowering Igbo Language Technologies through Multi-dialectal Enrichment

The IgboAPI Dataset: Empowering Igbo Language Technologies through Multi-dialectal Enrichment

URL: http://arxiv.org/abs/2405.00997v1
Date: Thu, 2 May 2024 04:27:35 GMT
Title: The IgboAPI Dataset: Empowering Igbo Language Technologies through Multi-dialectal Enrichment
Authors: Chris Chinenye Emezue, Ifeoma Okoh, Chinedu Mbonu, Chiamaka Chukwuneke, Daisy Lal, Ignatius Ezeani, Paul Rayson, Ijemma Onwuzulike, Chukwuma Okeke, Gerald Nweya, Bright Ogbonna, Chukwuebuka Oraegbunam, Esther Chidinma Awo-Ndubuisi, Akudo Amarachukwu Osuagwu, Obioha Nmezi,
Abstract summary: The Igbo language is facing a risk of becoming endangered, as indicated by a 2025 UNESCO study. To create robust, impactful, and widely adopted language technologies for Igbo, it is essential to incorporate the multi-dialectal nature of the language. We present the IgboAPI dataset, a multi-dialectal Igbo-English dictionary dataset, developed with the aim of enhancing the representation of Igbo dialects.
Score: 3.087699704782493
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The Igbo language is facing a risk of becoming endangered, as indicated by a 2025 UNESCO study. This highlights the need to develop language technologies for Igbo to foster communication, learning and preservation. To create robust, impactful, and widely adopted language technologies for Igbo, it is essential to incorporate the multi-dialectal nature of the language. The primary obstacle in achieving dialectal-aware language technologies is the lack of comprehensive dialectal datasets. In response, we present the IgboAPI dataset, a multi-dialectal Igbo-English dictionary dataset, developed with the aim of enhancing the representation of Igbo dialects. Furthermore, we illustrate the practicality of the IgboAPI dataset through two distinct studies: one focusing on Igbo semantic lexicon and the other on machine translation. In the semantic lexicon project, we successfully establish an initial Igbo semantic lexicon for the Igbo semantic tagger, while in the machine translation study, we demonstrate that by finetuning existing machine translation systems using the IgboAPI dataset, we significantly improve their ability to handle dialectal variations in sentences.

Related papers

Simultaneous Speech-to-Speech Translation Without Aligned Data [52.467808474293605]
Simultaneous speech translation requires translating source speech into a target language in real-time.<n>We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely.<n>Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks.
arXiv Detail & Related papers (2026-02-11T17:41:01Z)
WAXAL: A Large-Scale Multilingual African Language Speech Corpus [12.433885475371035]
WAXAL is a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers.<n>The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts.
arXiv Detail & Related papers (2026-02-02T19:49:19Z)
Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal [0.6089496237595778]
This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application.<n>It aims to digitally archive and promote the endangered Toto language of West Bengal, India.
arXiv Detail & Related papers (2025-10-26T11:22:46Z)
Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages [27.273651323572786]
BhasaAnuvaad is the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments.<n>Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation.<n>We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
arXiv Detail & Related papers (2024-11-07T13:33:34Z)
Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features [18.76505158652759]
We propose to exploit both semantic and linguistic features between multiple languages to enhance multilingual translation. On the encoder side, we introduce a disentangling learning task that aligns encoder representations by disentangling semantic and linguistic features. On the decoder side, we leverage a linguistic encoder to integrate low-level linguistic features to assist in the target language generation.
arXiv Detail & Related papers (2024-08-02T17:10:12Z)
Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language. These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z)
Language Detection for Transliterated Content [0.0]
We study the widespread use of transliteration, where the English alphabet is employed to convey messages in native languages. This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English. The research pioneers innovative approaches to identify and convert transliterated text.
arXiv Detail & Related papers (2024-01-09T15:40:54Z)
Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language. We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z)
Scaling Back-Translation with Domain Text Generation for Sign Language Gloss Translation [36.40377483258876]
Sign language gloss translation aims to translate the sign glosses into spoken language texts. Back translation (BT) generates pseudo-parallel data by translating in-domain spoken language texts into sign glosses. We propose a Prompt based domain text Generation (PGEN) approach to produce the large-scale spoken language text data.
arXiv Detail & Related papers (2022-10-13T14:25:08Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
Automatic Speech Recognition Datasets in Cantonese Language: A Survey and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation [1.1470070927586016]
Out of vocabulary (OOV) is a problem in the context of Machine Translation (MT) in low-resourced languages. This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers.
arXiv Detail & Related papers (2020-11-05T13:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.