SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual
Speech Representation
- URL: http://arxiv.org/abs/2205.08180v1
- Date: Tue, 17 May 2022 08:58:48 GMT
- Title: SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual
Speech Representation
- Authors: Sameer Khurana and Antoine Laurent and James Glass
- Abstract summary: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework.
We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR.
- Score: 11.552745999302905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level
Cross-Lingual Speech Representation learning framework. Unlike previous works
on speech representation learning, which learns multilingual contextual speech
embedding at the resolution of an acoustic frame (10-20ms), this work focuses
on learning multimodal (speech-text) multilingual speech embedding at the
resolution of a sentence (5-10s) such that the embedding vector space is
semantically aligned across different languages. We combine state-of-the-art
multilingual acoustic frame-level speech representation learning model XLS-R
with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an
utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we
train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual
speech-text and speech-speech associations emerge in its learned representation
space. To substantiate our claims, we use SAMU-XLSR speech encoder in
combination with a pre-trained LaBSE text sentence encoder for cross-lingual
speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual
speech-to-speech translation retrieval. We highlight these applications by
performing several cross-lingual text and speech translation retrieval tasks
across several datasets.
Related papers
- Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.
Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.
Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond [36.660499609887886]
Speech-MASSIVE is a multilingual Spoken Language Understanding dataset.
It covers 12 languages from different families and inherits from the annotations for the intent prediction and slot-filling tasks.
We demonstrate the suitability of Speech-MASSIVE for other tasks such as speech transcription, language identification, and speech translation.
arXiv Detail & Related papers (2024-08-07T16:55:28Z) - MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation [45.558316325252335]
Multitask Speech Language Model (MSLM) is a decoder-only speech language model trained in a multitask setting.
Our model is able to support multilingual S2ST with speaker style preserved.
arXiv Detail & Related papers (2024-03-19T03:35:20Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation [19.332953510406327]
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks.
Multilingual speech and text are encoded in a joint fixed-size representation space.
We compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities.
arXiv Detail & Related papers (2022-05-24T17:23:35Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.