Chain-based Discriminative Autoencoders for Speech Recognition
- URL: http://arxiv.org/abs/2203.13687v2
- Date: Mon, 28 Mar 2022 12:55:01 GMT
- Title: Chain-based Discriminative Autoencoders for Speech Recognition
- Authors: Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang
- Abstract summary: We propose three new versions of a discriminative autoencoder (DcAE) for speech recognition.
First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used.
For application to robust speech recognition, we extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE.
- Score: 16.21321835306968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In our previous work, we proposed a discriminative autoencoder (DcAE) for
speech recognition. DcAE combines two training schemes into one. First, since
DcAE aims to learn encoder-decoder mappings, the squared error between the
reconstructed speech and the input speech is minimized. Second, in the code
layer, frame-based phonetic embeddings are obtained by minimizing the
categorical cross-entropy between ground truth labels and predicted
triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating
various TDNN models as encoders. In this paper, we further propose three new
versions of DcAE. First, a new objective function that considers both
categorical cross-entropy and mutual information between ground truth and
predicted triphone-state sequences is used. The resulting DcAE is called a
chain-based DcAE (c-DcAE). For application to robust speech recognition, we
further extend c-DcAE to hierarchical and parallel structures, resulting in
hc-DcAE and pc-DcAE. In these two models, both the error between the
reconstructed noisy speech and the input noisy speech and the error between the
enhanced speech and the reference clean speech are taken into the objective
function. Experimental results on the WSJ and Aurora-4 corpora show that our
DcAE models outperform baseline systems.
Related papers
- One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Minimally-Supervised Speech Synthesis with Conditional Diffusion Model
and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems.
We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability.
Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z) - Bi-LSTM Scoring Based Similarity Measurement with Agglomerative
Hierarchical Clustering (AHC) for Speaker Diarization [0.0]
A typical conversation between two speakers consists of segments where their voices overlap, interrupt each other or halt their speech in between multiple sentences.
Recent advancements in diarization technology leverage neural network-based approaches to improvise speaker diarization system.
We propose a Bi-directional Long Short-term Memory network for estimating the elements present in the similarity matrix.
arXiv Detail & Related papers (2022-05-19T17:20:51Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - Unsupervised feature learning for speech using correspondence and
Siamese networks [24.22616495324351]
We compare two recent methods for frame-level acoustic feature learning.
For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type.
For the correspondence autoencoder (CAE), matching frames are presented as input-output pairs.
For the first time, these feature extractors are compared on the same discrimination tasks using the same weak supervision pairs.
arXiv Detail & Related papers (2020-03-28T14:31:01Z) - Unsupervised Speaker Adaptation using Attention-based Speaker Memory for
End-to-End ASR [61.55606131634891]
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR)
The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism.
We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes
arXiv Detail & Related papers (2020-02-14T18:31:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.