"I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion
Recognition
- URL: http://arxiv.org/abs/2005.07757v1
- Date: Fri, 15 May 2020 19:33:40 GMT
- Title: "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion
Recognition
- Authors: Mostafa M. Mohamed and Bj\"orn W. Schuller
- Abstract summary: In applications that use emotion recognition via speech, frame-loss can be a severe issue given manifold applications.
We investigate for the first time the effects of frame-loss on the performance of emotion recognition via speech.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In applications that use emotion recognition via speech, frame-loss can be a
severe issue given manifold applications, where the audio stream loses some
data frames, for a variety of reasons like low bandwidth. In this contribution,
we investigate for the first time the effects of frame-loss on the performance
of emotion recognition via speech. Reproducible extensive experiments are
reported on the popular RECOLA corpus using a state-of-the-art end-to-end deep
neural network, which mainly consists of convolution blocks and recurrent
layers. A simple environment based on a Markov Chain model is used to model the
loss mechanism based on two main parameters. We explore matched, mismatched,
and multi-condition training settings. As one expects, the matched setting
yields the best performance, while the mismatched yields the lowest.
Furthermore, frame-loss as a data augmentation technique is introduced as a
general-purpose strategy to overcome the effects of frame-loss. It can be used
during training, and we observed it to produce models that are more robust
against frame-loss in run-time environments.
Related papers
- Perception-Oriented Video Frame Interpolation via Asymmetric Blending [20.0024308216849]
Previous methods for Video Frame Interpolation (VFI) have encountered challenges, notably the manifestation of blur and ghosting effects.
We propose PerVFI (Perception-oriented Video Frame Interpolation) to mitigate these challenges.
Experimental results validate the superiority of PerVFI, demonstrating significant improvements in perceptual quality compared to existing methods.
arXiv Detail & Related papers (2024-04-10T02:40:17Z) - FusionFrames: Efficient Architectural Aspects for Text-to-Video
Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model.
The design of our model significantly reduces computational costs compared to other masked frame approaches.
We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Frame Flexible Network [52.623337134518835]
Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers.
If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly.
We propose a general framework, named Frame Flexible Network (FFN), which enables the model to be evaluated at different frames to adjust its computation.
arXiv Detail & Related papers (2023-03-26T20:51:35Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - YOLOV: Making Still Image Object Detectors Great at Video Object
Detection [23.039968987772543]
Video object detection (VID) is challenging because of the high variation of object appearance and the diverse deterioration in some frames.
This work proposes a simple yet effective strategy to address the concerns, which spends marginal overheads with significant gains in accuracy.
Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU)
arXiv Detail & Related papers (2022-08-20T14:12:06Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - SAFL: A Self-Attention Scene Text Recognizer with Focal Loss [4.462730814123762]
Scene text recognition remains challenging due to inherent problems such as distortions or irregular layout.
Most of the existing approaches mainly leverage recurrence or convolution-based neural networks.
We introduce SAFL, a self-attention-based neural network model with the focal loss for scene text recognition.
arXiv Detail & Related papers (2022-01-01T06:51:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.