SAN: a robust end-to-end ASR model architecture
- URL: http://arxiv.org/abs/2210.15285v1
- Date: Thu, 27 Oct 2022 09:36:25 GMT
- Title: SAN: a robust end-to-end ASR model architecture
- Authors: Zeping Min, Qian Ge, Guanhua Huang
- Abstract summary: Siamese Adversarial Network (SAN) architecture for automatic speech recognition.
SAN constructs two sub-networks to differentiate the audio feature input and then introduces a loss to unify the output distribution of these sub-networks.
We conduct numerical experiments with the SAN model on several datasets for the automatic speech recognition task.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel Siamese Adversarial Network (SAN)
architecture for automatic speech recognition, which aims at solving the
difficulty of fuzzy audio recognition. Specifically, SAN constructs two
sub-networks to differentiate the audio feature input and then introduces a
loss to unify the output distribution of these sub-networks. Adversarial
learning enables the network to capture more essential acoustic features and
helps the models achieve better performance when encountering fuzzy audio
input. We conduct numerical experiments with the SAN model on several datasets
for the automatic speech recognition task. All experimental results show that
the siamese adversarial nets significantly reduce the character error rate
(CER). Specifically, we achieve a new state of art 4.37 CER without language
model on the AISHELL-1 dataset, which leads to around 5% relative CER
reduction. To reveal the generality of the siamese adversarial net, we also
conduct experiments on the phoneme recognition task, which also shows the
superiority of the siamese adversarial network.
Related papers
- Deep Neural Networks for Automatic Speaker Recognition Do Not Learn
Supra-Segmental Temporal Features [2.724035499453558]
We present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST.
We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced.
arXiv Detail & Related papers (2023-11-01T12:45:31Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Streaming Speech-to-Confusion Network Speech Recognition [19.720334657478475]
We present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency.
We show that 1-best results of our model are on par with a comparable RNN-T system.
We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
arXiv Detail & Related papers (2023-06-02T20:28:14Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.