Filter-based Discriminative Autoencoders for Children Speech Recognition
- URL: http://arxiv.org/abs/2204.00164v1
- Date: Fri, 1 Apr 2022 02:18:57 GMT
- Title: Filter-based Discriminative Autoencoders for Children Speech Recognition
- Authors: Chiang-Lin Tai, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang
- Abstract summary: We propose a filter-based discriminative autoencoder for acoustic modeling.
In the training phase, the decoder uses the auxiliary information and the phonetic embedding extracted by the encoder.
The framework can make the phonetic embedding purer, resulting in more accurate senone (triphone-state) scores.
- Score: 25.279902171523233
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Children speech recognition is indispensable but challenging due to the
diversity of children's speech. In this paper, we propose a filter-based
discriminative autoencoder for acoustic modeling. To filter out the influence
of various speaker types and pitches, auxiliary information of the speaker and
pitch features is input into the encoder together with the acoustic features to
generate phonetic embeddings. In the training phase, the decoder uses the
auxiliary information and the phonetic embedding extracted by the encoder to
reconstruct the input acoustic features. The autoencoder is trained by
simultaneously minimizing the ASR loss and feature reconstruction error. The
framework can make the phonetic embedding purer, resulting in more accurate
senone (triphone-state) scores. Evaluated on the test set of the CMU Kids
corpus, our system achieves a 7.8% relative WER reduction compared to the
baseline system. In the domain adaptation experiment, our system also
outperforms the baseline system on the British-accent PF-STAR task.
Related papers
- A Neural Model for Contextual Biasing Score Learning and Filtering [11.862176451777286]
We use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder.<n>We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors.<n>Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.
arXiv Detail & Related papers (2025-10-27T20:41:52Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit
Normalization [60.43992089087448]
Dysarthric speech reconstruction systems aim to automatically convert dysarthric speech into normal-sounding speech.
We propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement.
Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks.
arXiv Detail & Related papers (2024-01-26T06:08:47Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Improving Accent Conversion with Reference Encoder and End-To-End
Text-To-Speech [23.30022534796909]
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
We propose approaches to improving accent conversion applicability, as well as quality.
arXiv Detail & Related papers (2020-05-19T08:09:58Z) - Multi-task self-supervised learning for Robust Speech Recognition [75.11748484288229]
This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments.
We employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances.
We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks.
arXiv Detail & Related papers (2020-01-25T00:24:45Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.