Direct Punjabi to English speech translation using discrete units
- URL: http://arxiv.org/abs/2402.15967v1
- Date: Sun, 25 Feb 2024 03:03:34 GMT
- Title: Direct Punjabi to English speech translation using discrete units
- Authors: Prabhjot Kaur, L. Andrew M. Bush, Weisong Shi
- Abstract summary: We present a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English.
We also explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model.
Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
- Score: 4.883313216485195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech-to-speech translation is yet to reach the same level of coverage as
text-to-text translation systems. The current speech technology is highly
limited in its coverage of over 7000 languages spoken worldwide, leaving more
than half of the population deprived of such technology and shared experiences.
With voice-assisted technology (such as social robots and speech-to-text apps)
and auditory content (such as podcasts and lectures) on the rise, ensuring that
the technology is available for all is more important than ever. Speech
translation can play a vital role in mitigating technological disparity and
creating a more inclusive society. With a motive to contribute towards speech
translation research for low-resource languages, our work presents a direct
speech-to-speech translation model for one of the Indic languages called
Punjabi to English. Additionally, we explore the performance of using a
discrete representation of speech called discrete acoustic units as input to
the Transformer-based translation model. The model, abbreviated as Unit-to-Unit
Translation (U2UT), takes a sequence of discrete units of the source language
(the language being translated from) and outputs a sequence of discrete units
of the target language (the language being translated to). Our results show
that the U2UT model performs better than the Speech-to-Unit Translation (S2UT)
model by a 3.69 BLEU score.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.