Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages
- URL: http://arxiv.org/abs/2201.11391v1
- Date: Thu, 27 Jan 2022 09:24:36 GMT
- Title: Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages
- Authors: Jivnesh Sandhan, Ayush Daksh, Om Adideva Paranjay, Laxmidhar Behera
and Pawan Goyal
- Abstract summary: Prabhupadavani is a multilingual code-mixed ST dataset for 25 languages.
It contains 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language.
This data also can be used for a code-mixed machine translation task.
- Score: 12.30099599834466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, code-mixing has become ubiquitous in Natural Language Processing
(NLP); however, no efforts have been made to address this phenomenon for Speech
Translation (ST) task. This can be solely attributed to the lack of code-mixed
ST task labelled data. Thus, we introduce Prabhupadavani, a multilingual
code-mixed ST dataset for 25 languages, covering ten language families,
containing 94 hours of speech by 130+ speakers, manually aligned with
corresponding text in the target language. Prabhupadvani is the first
code-mixed ST dataset available in the ST literature to the best of our
knowledge. This data also can be used for a code-mixed machine translation
task. All the dataset and code can be accessed at:
\url{https://github.com/frozentoad9/CMST}
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for
Offensive Language Identification [26.11758147703999]
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech.
We introduce OffMix-3L, a novel offensive language identification dataset containing code-mixed data from three different languages.
arXiv Detail & Related papers (2023-10-27T09:59:35Z) - My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models
and Evaluation Benchmarks [0.7874708385247353]
We focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing.
We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining.
We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus.
arXiv Detail & Related papers (2023-06-24T18:17:38Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Gui at MixMT 2022 : English-Hinglish: An MT approach for translation of
code mixed data [13.187116325089951]
We try to tackle the same for both English + Hindi to Hinglish and Hinglish to English.
To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation.
arXiv Detail & Related papers (2022-10-21T19:48:18Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.