Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio
Signals
- URL: http://arxiv.org/abs/2003.03375v1
- Date: Fri, 6 Mar 2020 12:28:04 GMT
- Title: Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio
Signals
- Authors: Eric Guizzo, Tillman Weyde, Jack Barnett Leveson
- Abstract summary: We introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing audio data.
We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes.
- Score: 7.219077740523682
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Robustness against temporal variations is important for emotion recognition
from speech audio, since emotion is ex-pressed through complex spectral
patterns that can exhibit significant local dilation and compression on the
time axis depending on speaker and context. To address this and potentially
other tasks, we introduce the multi-time-scale (MTS) method to create
flexibility towards temporal variations when analyzing time-frequency
representations of audio data. MTS extends convolutional neural networks with
convolution kernels that are scaled and re-sampled along the time axis, to
increase temporal flexibility without increasing the number of trainable
parameters compared to standard convolutional layers. We evaluate MTS and
standard convolutional layers in different architectures for emotion
recognition from speech audio, using 4 datasets of different sizes. The results
show that the use of MTS layers consistently improves the generalization of
networks of different capacity and depth, compared to standard convolution,
especially on smaller datasets
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z) - Dynamic Spectrum Mixer for Visual Recognition [17.180863898764194]
We propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM)
DSM represents token interactions in the frequency domain by employing the Cosine Transform.
It can learn long-term spatial dependencies with log-linear complexity.
arXiv Detail & Related papers (2023-09-13T04:51:15Z) - Graph-Aware Contrasting for Multivariate Time-Series Classification [50.84488941336865]
Existing contrastive learning methods mainly focus on achieving temporal consistency with temporal augmentation and contrasting techniques.
We propose Graph-Aware Contrasting for spatial consistency across MTS data.
Our proposed method achieves state-of-the-art performance on various MTS classification tasks.
arXiv Detail & Related papers (2023-09-11T02:35:22Z) - Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition [12.980843126905203]
We show that global attention over frequencies is beneficial over local convolution.
We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
arXiv Detail & Related papers (2023-06-12T08:37:36Z) - Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Multi-Temporal Convolutions for Human Action Recognition in Videos [83.43682368129072]
We present a novel temporal-temporal convolution block that is capable of extracting at multiple resolutions.
The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture.
arXiv Detail & Related papers (2020-11-08T10:40:26Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Wavelet Networks: Scale-Translation Equivariant Learning From Raw
Time-Series [31.73386289965465]
We find that scale-translation equivariant mappings share strong resemblance with the wavelet transform.
Inspired by this resemblance, we term our networks Wavelet Networks, and show that they perform nested non-linear wavelet-like time-frequency transforms.
arXiv Detail & Related papers (2020-06-09T13:50:34Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.