Multi-Modal Pre-Training for Automated Speech Recognition
- URL: http://arxiv.org/abs/2110.09890v1
- Date: Tue, 12 Oct 2021 17:07:25 GMT
- Title: Multi-Modal Pre-Training for Automated Speech Recognition
- Authors: David M. Chan, Shalini Ghosh, Debmalya Chakrabarty and Bj\"orn
Hoffmeister
- Abstract summary: We introduce a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs.
The resulting method can outperform baseline methods by up to 7% on Librispeech.
- Score: 11.451227239633553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditionally, research in automated speech recognition has focused on
local-first encoding of audio representations to predict the spoken phonemes in
an utterance. Unfortunately, approaches relying on such hyper-local information
tend to be vulnerable to both local-level corruption (such as audio-frame
drops, or loud noises) and global-level noise (such as environmental noise, or
background noise) that has not been seen during training. In this work, we
introduce a novel approach which leverages a self-supervised learning technique
based on masked language modeling to compute a global, multi-modal encoding of
the environment in which the utterance occurs. We then use a new deep-fusion
framework to integrate this global context into a traditional ASR method, and
demonstrate that the resulting method can outperform baseline methods by up to
7% on Librispeech; gains on internal datasets range from 6% (on larger models)
to 45% (on smaller models).
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Improving the Intent Classification accuracy in Noisy Environment [9.447108578893639]
In this paper, we investigate how environmental noise and related noise reduction techniques to address the intent classification task with end-to-end neural models.
For this task, the use of speech enhancement greatly improves the classification accuracy in noisy conditions.
arXiv Detail & Related papers (2023-03-12T06:11:44Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer
in ASR [13.726142328715897]
We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language.
Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language.
arXiv Detail & Related papers (2021-11-12T16:16:46Z) - A Review of Sound Source Localization with Deep Learning Methods [71.18444724397486]
This article is a review on deep learning methods for single and multiple sound source localization.
We provide an exhaustive topography of the neural-based localization literature in this context.
Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
arXiv Detail & Related papers (2021-09-08T07:25:39Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Multi-Staged Cross-Lingual Acoustic Model Adaption for Robust Speech
Recognition in Real-World Applications -- A Case Study on German Oral History
Interviews [21.47857960919014]
We propose an approach that performs a robust acoustic model adaption to a target domain in a cross-lingual, multi-staged manner.
Our approach enables the exploitation of large-scale training data from other domains in both the same and other languages.
arXiv Detail & Related papers (2020-05-26T08:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.