Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition
- URL: http://arxiv.org/abs/2109.09161v1
- Date: Sun, 19 Sep 2021 16:39:22 GMT
- Title: Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition
- Authors: Guolin Zheng, Yubei Xiao, Ke Gong, Pan Zhou, Xiaodan Liang, Liang Lin
- Abstract summary: Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
- Score: 159.9312272042253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unifying acoustic and linguistic representation learning has become
increasingly crucial to transfer the knowledge learned on the abundance of
high-resource language data for low-resource speech recognition. Existing
approaches simply cascade pre-trained acoustic and language models to learn the
transfer from speech to text. However, how to solve the representation
discrepancy of speech and text is unexplored, which hinders the utilization of
acoustic and linguistic information. Moreover, previous works simply replace
the embedding layer of the pre-trained language model with the acoustic
features, which may cause the catastrophic forgetting problem. In this work, we
introduce Wav-BERT, a cooperative acoustic and linguistic representation
learning method to fuse and utilize the contextual information of speech and
text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a
language model (BERT) into an end-to-end trainable framework. A Representation
Aggregation Module is designed to aggregate acoustic and linguistic
representation, and an Embedding Attention Module is introduced to incorporate
acoustic information into BERT, which can effectively facilitate the
cooperation of two pre-trained models and thus boost the representation
learning. Extensive experiments show that our Wav-BERT significantly
outperforms the existing approaches and achieves state-of-the-art performance
on low-resource speech recognition.
Related papers
- Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic
Word Embeddings [19.195728241989702]
We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings.
We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
arXiv Detail & Related papers (2022-09-14T13:33:04Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.