Visualising and Explaining Deep Learning Models for Speech Quality
Prediction
- URL: http://arxiv.org/abs/2112.06219v1
- Date: Sun, 12 Dec 2021 12:50:03 GMT
- Title: Visualising and Explaining Deep Learning Models for Speech Quality
Prediction
- Authors: H. Tilkorn, G. Mittag (1), S. M\"oller (1 and 2) ((1) Quality and
Usability Lab TU Berlin, (2) Language Technology DFKI Berlin)
- Abstract summary: The non-intrusive speech quality prediction model NISQA is analyzed in this paper.
It is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN)
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Estimating quality of transmitted speech is known to be a non-trivial task.
While traditionally, test participants are asked to rate the quality of
samples; nowadays, automated methods are available. These methods can be
divided into: 1) intrusive models, which use both, the original and the
degraded signals, and 2) non-intrusive models, which only require the degraded
signal. Recently, non-intrusive models based on neural networks showed to
outperform signal processing based models. However, the advantages of deep
learning based models come with the cost of being more challenging to
interpret. To get more insight into the prediction models the non-intrusive
speech quality prediction model NISQA is analyzed in this paper. NISQA is
composed of a convolutional neural network (CNN) and a recurrent neural network
(RNN). The task of the CNN is to compute relevant features for the speech
quality prediction on a frame level, while the RNN models time-dependencies
between the individual speech frames. Different explanation algorithms are used
to understand the automatically learned features of the CNN. In this way,
several interpretable features could be identified, such as the sensitivity to
noise or strong interruptions. On the other hand, it was found that multiple
features carry redundant information.
Related papers
- Benign Overfitting in Two-Layer ReLU Convolutional Neural Networks for
XOR Data [24.86314525762012]
We show that ReLU CNN trained by gradient descent can achieve near Bayes-optimal accuracy.
Our result demonstrates that CNNs have a remarkable capacity to efficiently learn XOR problems, even in the presence of highly correlated features.
arXiv Detail & Related papers (2023-10-03T11:31:37Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Lost Vibration Test Data Recovery Using Convolutional Neural Network: A
Case Study [0.0]
This paper proposes a CNN algorithm for the Alamosa Canyon Bridge as a real structure.
Three different CNN models were considered to predict one and two malfunctioned sensors.
The accuracy of the model was increased by adding a convolutional layer.
arXiv Detail & Related papers (2022-04-11T23:24:03Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - Image Quality Assessment using Contrastive Learning [50.265638572116984]
We train a deep Convolutional Neural Network (CNN) using a contrastive pairwise objective to solve the auxiliary problem.
We show through extensive experiments that CONTRIQUE achieves competitive performance when compared to state-of-the-art NR image quality models.
Our results suggest that powerful quality representations with perceptual relevance can be obtained without requiring large labeled subjective image quality datasets.
arXiv Detail & Related papers (2021-10-25T21:01:00Z) - What do End-to-End Speech Models Learn about Speaker, Language and
Channel Information? A Layer-wise and Neuron-level Analysis [16.850888973106706]
We conduct a post-hoc functional interpretability analysis of pretrained speech models using the probing framework.
We analyze utterance-level representations of speech models trained for various tasks such as speaker recognition and dialect identification.
Our results reveal several novel findings, including: i) channel and gender information are distributed across the network, ii) the information is redundantly available in neurons with respect to a task, and iv) complex properties such as dialectal information are encoded only in the task-oriented pretrained network.
arXiv Detail & Related papers (2021-07-01T13:32:55Z) - On the benefits of robust models in modulation recognition [53.391095789289736]
Deep Neural Networks (DNNs) using convolutional layers are state-of-the-art in many tasks in communications.
In other domains, like image classification, DNNs have been shown to be vulnerable to adversarial perturbations.
We propose a novel framework to test the robustness of current state-of-the-art models.
arXiv Detail & Related papers (2021-03-27T19:58:06Z) - Decentralizing Feature Extraction with Quantum Convolutional Neural
Network for Automatic Speech Recognition [101.69873988328808]
We build upon a quantum convolutional neural network (QCNN) composed of a quantum circuit encoder for feature extraction.
An input speech is first up-streamed to a quantum computing server to extract Mel-spectrogram.
The corresponding convolutional features are encoded using a quantum circuit algorithm with random parameters.
The encoded features are then down-streamed to the local RNN model for the final recognition.
arXiv Detail & Related papers (2020-10-26T03:36:01Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Inferring Convolutional Neural Networks' accuracies from their
architectural characterizations [0.0]
We study the relationships between a CNN's architecture and its performance.
We show that the attributes can be predictive of the networks' performance in two specific computer vision-based physics problems.
We use machine learning models to predict whether a network can perform better than a certain threshold accuracy before training.
arXiv Detail & Related papers (2020-01-07T16:41:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.