Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings
- URL: http://arxiv.org/abs/2210.12857v1
- Date: Sun, 23 Oct 2022 21:16:09 GMT
- Title: Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings
- Authors: Jian Zhu, Zuoyu Tian, Yadong Liu, Cong Zhang, Chia-wen Lo
- Abstract summary: This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
- Score: 4.582129557845177
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inducing semantic representations directly from speech signals is a highly
challenging task but has many useful applications in speech mining and spoken
language understanding. This study tackles the unsupervised learning of
semantic representations for spoken utterances. Through converting speech
signals into hidden units generated from acoustic unit discovery, we propose
WavEmbed, a multimodal sequential autoencoder that predicts hidden units from a
dense representation of speech. Secondly, we also propose S-HuBERT to induce
meaning through knowledge distillation, in which a sentence embedding model is
first trained on hidden units and passes its knowledge to a speech encoder
through contrastive learning. The best performing model achieves a moderate
correlation (0.5~0.6) with human judgments, without relying on any labels or
transcriptions. Furthermore, these models can also be easily extended to
leverage textual transcriptions of speech to learn much better speech
embeddings that are strongly correlated with human annotations. Our proposed
methods are applicable to the development of purely data-driven systems for
speech mining, indexing and search.
Related papers
- Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions.
Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - ESSumm: Extractive Speech Summarization from Untranscribed Meeting [7.309214379395552]
We propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm.
We leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio.
Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length.
arXiv Detail & Related papers (2022-09-14T20:13:15Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Learning De-identified Representations of Prosody from Raw Audio [7.025418443146435]
We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal.
We exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations.
arXiv Detail & Related papers (2021-07-17T14:37:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.