Related papers: A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition

A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition

URL: http://arxiv.org/abs/2210.00367v1
Date: Sat, 1 Oct 2022 20:47:25 GMT
Title: A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition
Authors: Kyuhong Shim, Wonyong Sung
Abstract summary: We compare and analyze CNN, RNN, Transformer, and Conformer models using phoneme recognition. Our analyses show that Transformer and Conformer models benefit from the long-range accessibility of self-attention through input frames.
Score: 16.206467862132012
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Phoneme recognition is a very important part of speech recognition that requires the ability to extract phonetic features from multiple frames. In this paper, we compare and analyze CNN, RNN, Transformer, and Conformer models using phoneme recognition. For CNN, the ContextNet model is used for the experiments. First, we compare the accuracy of various architectures under different constraints, such as the receptive field length, parameter size, and layer depth. Second, we interpret the performance difference of these models, especially when the observable sequence length varies. Our analyses show that Transformer and Conformer models benefit from the long-range accessibility of self-attention through input frames.

Related papers

TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs. TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z)
Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition [11.116921653535226]
We investigate two frameworks that combine CNN vision backbone and Transformer to enhance fine-grained action recognition. Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association. We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.
arXiv Detail & Related papers (2022-08-03T08:01:55Z)
Visualising and Explaining Deep Learning Models for Speech Quality Prediction [0.0]
The non-intrusive speech quality prediction model NISQA is analyzed in this paper. It is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN)
arXiv Detail & Related papers (2021-12-12T12:50:03Z)
Neural String Edit Distance [77.72325513792981]
We propose the neural string edit distance model for string-pair classification and sequence generation. We modify the original expectation-maximization learned edit distance algorithm into a differentiable loss function. We show that we can trade off between performance and interpretability in a single framework.
arXiv Detail & Related papers (2021-04-16T22:16:47Z)
End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer) In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms. We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
Relative Positional Encoding for Speech Recognition and Direct Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer. As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework. It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR) We propose the convolution-augmented transformer for speech recognition, named Conformer. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We focus on the scene context provided by the visual information, to ground the ASR. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications. We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.