Exploring Turkish Speech Recognition via Hybrid CTC/Attention
Architecture and Multi-feature Fusion Network
- URL: http://arxiv.org/abs/2303.12300v1
- Date: Wed, 22 Mar 2023 04:11:35 GMT
- Title: Exploring Turkish Speech Recognition via Hybrid CTC/Attention
Architecture and Multi-feature Fusion Network
- Authors: Zeyu Ren, Nurmement Yolwas, Huiru Wang, Wushour Slamu
- Abstract summary: This paper studies a series of speech recognition tuning technologies.
To maximize the use of effective feature information, this paper proposes a new feature extractor LSPC.
Our model performance achieves comparable to that of advanced End-to-End models.
- Score: 1.514049362441354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, End-to-End speech recognition technology based on deep
learning has developed rapidly. Due to the lack of Turkish speech data, the
performance of Turkish speech recognition system is poor. Firstly, this paper
studies a series of speech recognition tuning technologies. The results show
that the performance of the model is the best when the data enhancement
technology combining speed perturbation with noise addition is adopted and the
beam search width is set to 16. Secondly, to maximize the use of effective
feature information and improve the accuracy of feature extraction, this paper
proposes a new feature extractor LSPC. LSPC and LiGRU network are combined to
form a shared encoder structure, and model compression is realized. The results
show that the performance of LSPC is better than MSPC and VGGnet when only
using Fbank features, and the WER is improved by 1.01% and 2.53% respectively.
Finally, based on the above two points, a new multi-feature fusion network is
proposed as the main structure of the encoder. The results show that the WER of
the proposed feature fusion network based on LSPC is improved by 0.82% and
1.94% again compared with the single feature (Fbank feature and Spectrogram
feature) extraction using LSPC. Our model achieves performance comparable to
that of advanced End-to-End models.
Related papers
- Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection [6.367999777464464]
multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting.
In this paper, we introduce the Straight-through Gumbel-Softmax framework, offering a comprehensive approach to search multimodal fusion model architectures.
Experiments on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4% achieved with minimal model parameters.
arXiv Detail & Related papers (2024-06-19T09:26:22Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Multimodal Fake News Detection via CLIP-Guided Learning [26.093561485807832]
This paper proposes a FND-CLIP framework, i.e., a multimodal Fake News Detection network based on Contrastive Language-Image Pretraining (CLIP)
Given a targeted multimodal news, we extract the deep representations from the image and text using a ResNet-based encoder, a BERT-based encoder and two pair-wise CLIP encoders.
The multimodal feature is a concatenation of the CLIP-generated features weighted by the standardized cross-modal similarity of the two modalities.
arXiv Detail & Related papers (2022-05-28T02:43:18Z) - Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers [33.725831884078744]
The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.
We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
arXiv Detail & Related papers (2021-07-07T04:12:06Z) - Raw Waveform Encoder with Multi-Scale Globally Attentive Locally
Recurrent Networks for End-to-End Speech Recognition [45.858039215825656]
We propose a new encoder that adopts globally attentive locally recurrent (GALR) networks and directly takes raw waveform as input.
Experiments are conducted on a benchmark dataset AISHELL-2 and two large-scale Mandarin speech corpus of 5,000 hours and 21,000 hours.
arXiv Detail & Related papers (2021-06-08T12:12:33Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.