Full-Reference Speech Quality Estimation with Attentional Siamese Neural
Networks
- URL: http://arxiv.org/abs/2105.00783v1
- Date: Mon, 3 May 2021 12:38:25 GMT
- Title: Full-Reference Speech Quality Estimation with Attentional Siamese Neural
Networks
- Authors: Gabriel Mittags, Sebastian M\"oller
- Abstract summary: We present a full-reference speech quality prediction model with a deep learning approach.
The model determines a feature representation of the reference and the degraded signal through a siamese recurrent convolutional network.
The resulting features are then used to align the signals with an attention mechanism and are finally combined to estimate the overall speech quality.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a full-reference speech quality prediction model
with a deep learning approach. The model determines a feature representation of
the reference and the degraded signal through a siamese recurrent convolutional
network that shares the weights for both signals as input. The resulting
features are then used to align the signals with an attention mechanism and are
finally combined to estimate the overall speech quality. The proposed network
architecture represents a simple solution for the time-alignment problem that
occurs for speech signals transmitted through Voice-Over-IP networks and shows
how the clean reference signal can be incorporated into speech quality models
that are based on end-to-end trained neural networks.
Related papers
- Speech enhancement with frequency domain auto-regressive modeling [34.55703785405481]
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2023-09-24T03:25:51Z) - Audio-Visual Speech Enhancement with Score-Based Generative Models [22.559617939136505]
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models.
We exploit audio-visual embeddings obtained from a self-super-vised learning model that has been fine-tuned on lipreading.
Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality.
arXiv Detail & Related papers (2023-06-02T10:43:42Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Leveraging Symmetrical Convolutional Transformer Networks for Speech to
Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody.
Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Visualising and Explaining Deep Learning Models for Speech Quality
Prediction [0.0]
The non-intrusive speech quality prediction model NISQA is analyzed in this paper.
It is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN)
arXiv Detail & Related papers (2021-12-12T12:50:03Z) - HASA-net: A non-intrusive hearing-aid speech assessment network [52.83357278948373]
We propose a DNN-based hearing aid speech assessment network (HASA-Net) to predict speech quality and intelligibility scores simultaneously.
To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids.
Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics.
arXiv Detail & Related papers (2021-11-10T14:10:13Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Adaptation Algorithms for Neural Network-Based Speech Recognition: An
Overview [43.12352697785169]
We present a structured overview of adaptation algorithms for neural network-based speech recognition.
The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation.
We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.
arXiv Detail & Related papers (2020-08-14T21:50:03Z) - Sparse Mixture of Local Experts for Efficient Speech Enhancement [19.645016575334786]
We investigate a deep learning approach for speech denoising through an efficient ensemble of specialist neural networks.
By splitting up the speech denoising task into non-overlapping subproblems, we are able to improve denoising performance while also reducing computational complexity.
Our findings demonstrate that a fine-tuned ensemble network is able to exceed the speech denoising capabilities of a generalist network.
arXiv Detail & Related papers (2020-05-16T23:23:22Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.