CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning
- URL: http://arxiv.org/abs/2006.02814v2
- Date: Wed, 5 Aug 2020 07:28:36 GMT
- Title: CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning
- Authors: Sameer Khurana, Antoine Laurent, James Glass
- Abstract summary: More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
- Score: 11.552745999302905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: More than half of the 7,000 languages in the world are in imminent danger of
going extinct. Traditional methods of documenting language proceed by
collecting audio data followed by manual annotation by trained linguists at
different levels of granularity. This time consuming and painstaking process
could benefit from machine learning. Many endangered languages do not have any
orthographic form but usually have speakers that are bi-lingual and trained in
a high resource language. It is relatively easy to obtain textual translations
corresponding to speech. In this work, we provide a multimodal machine learning
framework for speech representation learning by exploiting the correlations
between the two modalities namely speech and its corresponding text
translation. Here, we construct a convolutional neural network audio encoder
capable of extracting linguistic representations from speech. The audio encoder
is trained to perform a speech-translation retrieval task in a contrastive
learning framework. By evaluating the learned representations on a phone
recognition task, we demonstrate that linguistic representations emerge in the
audio encoder's internal representations as a by-product of learning to perform
the retrieval task.
Related papers
- Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - CLSRIL-23: Cross Lingual Speech Representations for Indic Languages [0.0]
CLSRIL-23 is a self supervised learning based model which learns cross lingual speech representations from raw audio across 23 Indic languages.
It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations.
We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining.
arXiv Detail & Related papers (2021-07-15T15:42:43Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.