Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals
- URL: http://arxiv.org/abs/2203.05780v1
- Date: Fri, 11 Mar 2022 07:27:42 GMT
- Title: Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals
- Authors: Rahil Parikh, Nadee Seneviratne, Ganesh Sivaraman, Shihab Shamma,
Carol Espy-Wilson
- Abstract summary: We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
- Score: 5.743287315640403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-resolution spectro-temporal features of a speech signal represent how
the brain perceives sounds by tuning cortical cells to different spectral and
temporal modulations. These features produce a higher dimensional
representation of the speech signals. The purpose of this paper is to evaluate
how well the auditory cortex representation of speech signals contribute to
estimate articulatory features of those corresponding signals. Since obtaining
articulatory features from acoustic features of speech signals has been a
challenging topic of interest for different speech communities, we investigate
the possibility of using this multi-resolution representation of speech signals
as acoustic features. We used U. of Wisconsin X-ray Microbeam (XRMB) database
of clean speech signals to train a feed-forward deep neural network (DNN) to
estimate articulatory trajectories of six tract variables. The optimal set of
multi-resolution spectro-temporal features to train the model were chosen using
appropriate scale and rate vector parameters to obtain the best performing
model. Experiments achieved a correlation of 0.675 with ground-truth tract
variables. We compared the performance of this speech inversion system with
prior experiments conducted using Mel Frequency Cepstral Coefficients (MFCCs).
Related papers
- PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - Synthesized Speech Detection Using Convolutional Transformer-Based
Spectrogram Analysis [16.93803259128475]
Synthesized speech can be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal.
In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer for synthesized speech detection.
arXiv Detail & Related papers (2022-05-03T22:05:35Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - EEGminer: Discovering Interpretable Features of Brain Activity with
Learnable Filters [72.19032452642728]
We propose a novel differentiable EEG decoding pipeline consisting of learnable filters and a pre-determined feature extraction module.
We demonstrate the utility of our model towards emotion recognition from EEG signals on the SEED dataset and on a new EEG dataset of unprecedented size.
The discovered features align with previous neuroscience studies and offer new insights, such as marked differences in the functional connectivity profile between left and right temporal areas during music listening.
arXiv Detail & Related papers (2021-10-19T14:22:04Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Extracting the Locus of Attention at a Cocktail Party from Single-Trial
EEG using a Joint CNN-LSTM Model [0.1529342790344802]
Human brain performs remarkably well in segregating a particular speaker from interfering speakers in a multi-speaker scenario.
We present a joint convolutional neural network (CNN) - long short-term memory (LSTM) model to infer the auditory attention.
arXiv Detail & Related papers (2021-02-08T01:06:48Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z) - Unsupervised Cross-Domain Speech-to-Speech Conversion with
Time-Frequency Consistency [14.062850439230111]
We propose a condition encouraging spectrogram consistency during the adversarial training procedure.
Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
arXiv Detail & Related papers (2020-05-15T22:27:07Z) - Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio
Signals [7.219077740523682]
We introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing audio data.
We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes.
arXiv Detail & Related papers (2020-03-06T12:28:04Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.