Related papers: Cross-Lingual Transfer Learning for Speech Translation

Cross-Lingual Transfer Learning for Speech Translation

URL: http://arxiv.org/abs/2407.01130v1
Date: Mon, 1 Jul 2024 09:51:48 GMT
Title: Cross-Lingual Transfer Learning for Speech Translation
Authors: Rao Ma, Yassir Fathullah, Mengjie Qian, Siyuan Tang, Mark Gales, Kate Knill,
Abstract summary: Zero-shot cross-lingual transfer has been demonstrated on a range of NLP tasks. We explore whether speech-based models exhibit the same transfer capability.
Score: 7.802021866251242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been increasing interest in building multilingual foundation models for NLP and speech research. Zero-shot cross-lingual transfer has been demonstrated on a range of NLP tasks where a model fine-tuned on task-specific data in one language yields performance gains in other languages. Here, we explore whether speech-based models exhibit the same transfer capability. Using Whisper as an example of a multilingual speech foundation model, we examine the utterance representation generated by the speech encoder. Despite some language-sensitive information being preserved in the audio embedding, words from different languages are mapped to a similar semantic space, as evidenced by a high recall rate in a speech-to-speech retrieval task. Leveraging this shared embedding space, zero-shot cross-lingual transfer is demonstrated in speech translation. When the Whisper model is fine-tuned solely on English-to-Chinese translation data, performance improvements are observed for input utterances in other languages. Additionally, experiments on low-resource languages show that Whisper can perform speech translation for utterances from languages unseen during pre-training by utilizing cross-lingual representations.

Related papers

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data [44.83532231917504]
TranSentence is a novel speech-to-speech translation without language-parallel speech data. We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder. We extend TranSentence to multilingual speech-to-speech translation.
arXiv Detail & Related papers (2024-01-17T11:52:40Z)
Towards a Deep Understanding of Multilingual End-to-End Speech Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct. It is relatively easy to obtain textual translations corresponding to speech. We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.