ConcealNet: An End-to-end Neural Network for Packet Loss Concealment in
Deep Speech Emotion Recognition
- URL: http://arxiv.org/abs/2005.07777v1
- Date: Fri, 15 May 2020 20:43:02 GMT
- Title: ConcealNet: An End-to-end Neural Network for Packet Loss Concealment in
Deep Speech Emotion Recognition
- Authors: Mostafa M. Mohamed and Bj\"orn W. Schuller
- Abstract summary: Packet loss is a common problem in data transmission, including speech data transmission.
In this paper, we present a concealment wrapper, which can be used with stacked recurrent neural cells.
The proposed ConcealNet model has shown considerable improvement, for both audio reconstruction and the corresponding emotion prediction.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Packet loss is a common problem in data transmission, including speech data
transmission. This may affect a wide range of applications that stream audio
data, like streaming applications or speech emotion recognition (SER). Packet
Loss Concealment (PLC) is any technique of facing packet loss. Simple PLC
baselines are 0-substitution or linear interpolation. In this paper, we present
a concealment wrapper, which can be used with stacked recurrent neural cells.
The concealment cell can provide a recurrent neural network (ConcealNet), that
performs real-time step-wise end-to-end PLC at inference time. Additionally,
extending this with an end-to-end emotion prediction neural network provides a
network that performs SER from audio with lost frames, end-to-end. The proposed
model is compared against the fore-mentioned baselines. Additionally, a
bidirectional variant with better performance is utilised. For evaluation, we
chose the public RECOLA dataset given its long audio tracks with continuous
emotion labels. ConcealNet is evaluated on the reconstruction of the audio and
the quality of corresponding emotions predicted after that. The proposed
ConcealNet model has shown considerable improvement, for both audio
reconstruction and the corresponding emotion prediction, in environments that
do not have losses with long duration, even when the losses occur frequently.
Related papers
- Speech Enhancement for Virtual Meetings on Cellular Networks [1.487576938041254]
We study speech enhancement using deep learning (DL) for virtual meetings on cellular devices.
We collect a transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network.
The goal of this project is to enhance the speech transmitted over the cellular networks using deep learning models.
arXiv Detail & Related papers (2023-02-02T04:35:48Z) - Synthetic Voice Detection and Audio Splicing Detection using
SE-Res2Net-Conformer Architecture [2.9805017559176883]
This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features.
Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance.
This paper also proposes to re-formulate the existing audio splicing detection problem.
arXiv Detail & Related papers (2022-10-07T14:30:13Z) - STIP: A SpatioTemporal Information-Preserving and Perception-Augmented
Model for High-Resolution Video Prediction [78.129039340528]
We propose a Stemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems.
The proposed model aims to preserve thetemporal information for videos during the feature extraction and the state transitions.
Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods.
arXiv Detail & Related papers (2022-06-09T09:49:04Z) - Network state Estimation using Raw Video Analysis: vQoS-GAN based
non-intrusive Deep Learning Approach [5.8010446129208155]
vQoS GAN can estimate the network state parameters from the degraded received video data.
A robust and unique design of deep learning network model has been trained with the video data along with data rate and packet loss class labels.
The proposed semi supervised generative adversarial network can additionally reconstruct the degraded video data to its original form for a better end user experience.
arXiv Detail & Related papers (2022-03-22T10:42:19Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - A Deep Learning Approach for Low-Latency Packet Loss Concealment of
Audio Signals in Networked Music Performance Applications [66.56753488329096]
Networked Music Performance (NMP) is envisioned as a potential game changer among Internet applications.
This article describes a technique for predicting lost packet content in real-time using a deep learning approach.
arXiv Detail & Related papers (2020-07-14T15:51:52Z) - "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion
Recognition [0.0]
In applications that use emotion recognition via speech, frame-loss can be a severe issue given manifold applications.
We investigate for the first time the effects of frame-loss on the performance of emotion recognition via speech.
arXiv Detail & Related papers (2020-05-15T19:33:40Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.