Pretrained Semantic Speech Embeddings for End-to-End Spoken Language
Understanding via Cross-Modal Teacher-Student Learning
- URL: http://arxiv.org/abs/2007.01836v2
- Date: Tue, 11 Aug 2020 23:32:56 GMT
- Title: Pretrained Semantic Speech Embeddings for End-to-End Spoken Language
Understanding via Cross-Modal Teacher-Student Learning
- Authors: Pavel Denisov, Ngoc Thang Vu
- Abstract summary: We propose a novel training method that enables pretrained contextual embeddings to process acoustic features.
We extend it with an encoder of pretrained speech recognition systems in order to construct end-to-end spoken language understanding systems.
- Score: 31.7865837105092
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Spoken language understanding is typically based on pipeline architectures
including speech recognition and natural language understanding steps. These
components are optimized independently to allow usage of available data, but
the overall system suffers from error propagation. In this paper, we propose a
novel training method that enables pretrained contextual embeddings to process
acoustic features. In particular, we extend it with an encoder of pretrained
speech recognition systems in order to construct end-to-end spoken language
understanding systems. Our proposed method is based on the teacher-student
framework across speech and text modalities that aligns the acoustic and the
semantic latent spaces. Experimental results in three benchmarks show that our
system reaches the performance comparable to the pipeline architecture without
using any training data and outperforms it after fine-tuning with ten examples
per class on two out of three benchmarks.
Related papers
- Learning Disentangled Speech Representations [0.412484724941528]
SynSpeech is a novel large-scale synthetic speech dataset designed to enable research on disentangled speech representations.
We present a framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics.
We find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity.
arXiv Detail & Related papers (2023-11-04T04:54:17Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Analysis of Joint Speech-Text Embeddings for Semantic Matching [3.6423306784901235]
We study a joint speech-text embedding space trained for semantic matching by minimizing the distance between paired utterance and transcription inputs.
We extend our method to incorporate automatic speech recognition through both pretraining and multitask scenarios.
arXiv Detail & Related papers (2022-04-04T04:50:32Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Speak or Chat with Me: End-to-End Spoken Language Understanding System
with Flexible Inputs [21.658650440278063]
We propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both.
Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance.
arXiv Detail & Related papers (2021-04-07T20:48:08Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.