Word Order Does Not Matter For Speech Recognition
- URL: http://arxiv.org/abs/2110.05994v1
- Date: Tue, 12 Oct 2021 13:35:01 GMT
- Title: Word Order Does Not Matter For Speech Recognition
- Authors: Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve and
Ronan Collobert
- Abstract summary: We train a word-level acoustic model which aggregates the distribution of all output frames.
We then train a letter-based acoustic model using Connectionist Temporal Classification loss.
Our system achieves 2.4%/5.3% on test-clean/test-other subsets of LibriSpeech.
- Score: 35.96275156733138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study training of automatic speech recognition system in a
weakly supervised setting where the order of words in transcript labels of the
audio training data is not known. We train a word-level acoustic model which
aggregates the distribution of all output frames using LogSumExp operation and
uses a cross-entropy loss to match with the ground-truth words distribution.
Using the pseudo-labels generated from this model on the training set, we then
train a letter-based acoustic model using Connectionist Temporal Classification
loss. Our system achieves 2.4%/5.3% on test-clean/test-other subsets of
LibriSpeech, which is competitive with the supervised baseline's performance.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose a self-supervised continual learning approach for Automatic Speech Recognition.
We use a memory-enhanced ASR model from the literature to decode new words from the slides.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Unsupervised Improvement of Audio-Text Cross-Modal Representations [19.960695758478153]
We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
arXiv Detail & Related papers (2023-05-03T02:30:46Z) - Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting [23.627625026135505]
We propose a novel end-to-end user-defined keyword spotting method.
Our method compares input queries with an enrolled text keyword sequence.
We introduce the LibriPhrase dataset for efficiently training keyword spotting models.
arXiv Detail & Related papers (2022-06-30T16:40:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Curriculum optimization for low-resource speech recognition [4.803994937990389]
We propose an automated curriculum learning approach to optimize the sequence of training examples.
We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions.
arXiv Detail & Related papers (2022-02-17T19:47:50Z) - Wake Word Detection with Alignment-Free Lattice-Free MMI [66.12175350462263]
Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input.
We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data.
We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures.
arXiv Detail & Related papers (2020-05-17T19:22:25Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.