EMA2S: An End-to-End Multimodal Articulatory-to-Speech System
- URL: http://arxiv.org/abs/2102.03786v1
- Date: Sun, 7 Feb 2021 12:14:14 GMT
- Title: EMA2S: An End-to-End Multimodal Articulatory-to-Speech System
- Authors: Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman,
Wen-Chin Huang, Xugang Lu, Yu Tsao
- Abstract summary: We present EMA2S, an end-to-end multimodal articulatory-to-speech system.
We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features.
- Score: 26.491629363635454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesized speech from articulatory movements can have real-world use for
patients with vocal cord disorders, situations requiring silent speech, or in
high-noise environments. In this work, we present EMA2S, an end-to-end
multimodal articulatory-to-speech system that directly converts articulatory
movements to speech signals. We use a neural-network-based vocoder combined
with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and
deep features. The experimental results confirm that the multimodal approach of
EMA2S outperforms the baseline system in terms of both objective evaluation and
subjective evaluation metrics. Moreover, results demonstrate that joint
mel-spectrogram and deep feature loss training can effectively improve system
performance.
Related papers
- Audio-Vision Contrastive Learning for Phonological Class Recognition [6.476789653980653]
We propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions.<n> Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-07-23T16:44:22Z) - Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture [11.063156506583562]
We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy.
We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs.
Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
arXiv Detail & Related papers (2024-06-05T13:50:59Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video
Emotion Recognition Inference [6.279057784373124]
In this paper, we design a fully multimodal video-to-emotion system (FV2ES) for fast yet effective recognition inference.
The adoption of the hierarchical attention method upon the sound spectra breaks through the limited contribution of the acoustic modality.
The further integration of data pre-processing into the aligned multimodal learning model allows the significant reduction of computational costs and storage space.
arXiv Detail & Related papers (2022-09-21T08:05:26Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.