SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual
Speech Representation
- URL: http://arxiv.org/abs/2205.08180v1
- Date: Tue, 17 May 2022 08:58:48 GMT
- Title: SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual
Speech Representation
- Authors: Sameer Khurana and Antoine Laurent and James Glass
- Abstract summary: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework.
We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR.
- Score: 11.552745999302905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level
Cross-Lingual Speech Representation learning framework. Unlike previous works
on speech representation learning, which learns multilingual contextual speech
embedding at the resolution of an acoustic frame (10-20ms), this work focuses
on learning multimodal (speech-text) multilingual speech embedding at the
resolution of a sentence (5-10s) such that the embedding vector space is
semantically aligned across different languages. We combine state-of-the-art
multilingual acoustic frame-level speech representation learning model XLS-R
with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an
utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we
train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual
speech-text and speech-speech associations emerge in its learned representation
space. To substantiate our claims, we use SAMU-XLSR speech encoder in
combination with a pre-trained LaBSE text sentence encoder for cross-lingual
speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual
speech-to-speech translation retrieval. We highlight these applications by
performing several cross-lingual text and speech translation retrieval tasks
across several datasets.
Related papers
- MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation [45.558316325252335]
Multitask Speech Language Model (MSLM) is a decoder-only speech language model trained in a multitask setting.
Our model is able to support multilingual S2ST with speaker style preserved.
arXiv Detail & Related papers (2024-03-19T03:35:20Z) - Many-to-Many Spoken Language Translation via Unified Speech and Text
Representation Learning with Unit-to-Unit Translation [39.74625363642717]
We represent multilingual speech audio with speech units, the quantized representations of speech features encoded from a self-supervised speech model.
Then, we propose to train an encoder-decoder structured model with a Unit-to-Unit Translation (UTUT) objective on multilingual data.
A single pre-trained model with UTUT can be employed for diverse multilingual speech- and text-related tasks, such as Speech-to-Speech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Textless Low-Resource Speech-to-Speech Translation With Unit Language
Models [56.1058530241461]
We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems.
We finetune S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation [19.332953510406327]
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks.
Multilingual speech and text are encoded in a joint fixed-size representation space.
We compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities.
arXiv Detail & Related papers (2022-05-24T17:23:35Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.