An Effective Mixture-Of-Experts Approach For Code-Switching Speech
Recognition Leveraging Encoder Disentanglement
- URL: http://arxiv.org/abs/2402.17189v1
- Date: Tue, 27 Feb 2024 04:08:59 GMT
- Title: An Effective Mixture-Of-Experts Approach For Code-Switching Speech
Recognition Leveraging Encoder Disentanglement
- Authors: Tzu-Ting Yang, Hsin-Wei Wang, Yi-Cheng Wang, Chi-Han Lin, and Berlin
Chen
- Abstract summary: Codeswitching phenomenon remains a major obstacle that hinders automatic speech recognition.
We introduce a novel disentanglement loss to enable the lower-layer of the encoder to capture inter-lingual acoustic information.
We verify that our proposed method outperforms the prior-art methods using pretrained dual-encoders.
- Score: 9.28943772676672
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the massive developments of end-to-end (E2E) neural networks, recent
years have witnessed unprecedented breakthroughs in automatic speech
recognition (ASR). However, the codeswitching phenomenon remains a major
obstacle that hinders ASR from perfection, as the lack of labeled data and the
variations between languages often lead to degradation of ASR performance. In
this paper, we focus exclusively on improving the acoustic encoder of E2E ASR
to tackle the challenge caused by the codeswitching phenomenon. Our main
contributions are threefold: First, we introduce a novel disentanglement loss
to enable the lower-layer of the encoder to capture inter-lingual acoustic
information while mitigating linguistic confusion at the higher-layer of the
encoder. Second, through comprehensive experiments, we verify that our proposed
method outperforms the prior-art methods using pretrained dual-encoders,
meanwhile having access only to the codeswitching corpus and consuming half of
the parameterization. Third, the apparent differentiation of the encoders'
output features also corroborates the complementarity between the
disentanglement loss and the mixture-of-experts (MoE) architecture.
Related papers
- $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Rethinking Speech Recognition with A Multimodal Perspective via Acoustic
and Semantic Cooperative Decoding [29.80299587861207]
We propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR.
Unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively.
We show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
arXiv Detail & Related papers (2023-05-23T13:25:44Z) - Challenging Decoder helps in Masked Auto-Encoder Pre-training for Dense
Passage Retrieval [10.905033385938982]
Masked auto-encoder (MAE) pre-training architecture has emerged as the most promising.
We propose a novel token importance aware masking strategy based on pointwise mutual information to intensify the challenge of the decoder.
arXiv Detail & Related papers (2023-05-22T16:27:10Z) - Denoising Diffusion Error Correction Codes [92.10654749898927]
Recently, neural decoders have demonstrated their advantage over classical decoding techniques.
Recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders.
We propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths.
arXiv Detail & Related papers (2022-09-16T11:00:50Z) - ASR Error Correction with Constrained Decoding on Operation Prediction [8.701142327932484]
We propose an ASR error correction method utilizing the predictions of correction operations.
Experiments on three public datasets demonstrate the effectiveness of the proposed approach in reducing the latency of the decoding process.
arXiv Detail & Related papers (2022-08-09T09:59:30Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - Class-Conditional Defense GAN Against End-to-End Speech Attacks [82.21746840893658]
We propose a novel approach against end-to-end adversarial attacks developed to fool advanced speech-to-text systems such as DeepSpeech and Lingvo.
Unlike conventional defense approaches, the proposed approach does not directly employ low-level transformations such as autoencoding a given input signal.
Our defense-GAN considerably outperforms conventional defense algorithms in terms of word error rate and sentence level recognition accuracy.
arXiv Detail & Related papers (2020-10-22T00:02:02Z) - Infomax Neural Joint Source-Channel Coding via Adversarial Bit Flip [41.28049430114734]
We propose a novel regularization method called Infomax Adversarial-Bit-Flip (IABF) to improve the stability and robustness of the neural joint source-channel coding scheme.
Our IABF can achieve state-of-the-art performances on both compression and error correction benchmarks and outperform the baselines by a significant margin.
arXiv Detail & Related papers (2020-04-03T10:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.