Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis
- URL: http://arxiv.org/abs/2012.15184v1
- Date: Wed, 30 Dec 2020 15:09:02 GMT
- Title: Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic
Speech Synthesis
- Authors: Jose A. Gonzalez-Lopez and Miriam Gonzalez-Atienza and Alejandro
Gomez-Alanis and Jose L. Perez-Cordoba and Phil D. Green
- Abstract summary: Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators.
This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury.
We propose a solution to this problem based on the theory of multi-view learning.
- Score: 59.623780036359655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible
speech from captured movement of the speech articulators. This technique has
numerous applications, such as restoring oral communication to people who
cannot longer speak due to illness or injury. Most successful techniques so far
adopt a supervised learning framework, in which time-synchronous
articulatory-and-speech recordings are used to train a supervised machine
learning algorithm that can be used later to map articulator movements to
speech. This, however, prevents the application of A2A techniques in cases
where parallel data is unavailable, e.g., a person has already lost her/his
voice and only articulatory data can be captured. In this work, we propose a
solution to this problem based on the theory of multi-view learning. The
proposed algorithm attempts to find an optimal temporal alignment between pairs
of non-aligned articulatory-and-acoustic sequences with the same phonetic
content by projecting them into a common latent space where both views are
maximally correlated and then applying dynamic time warping. Several variants
of this idea are discussed and explored. We show that the quality of speech
generated in the non-aligned scenario is comparable to that obtained in the
parallel scenario.
Related papers
- Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE [2.771610203951056]
This study examines how articulatory information can be used for discovering speech units in a self-supervised setting.
We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data.
Experiments were conducted on three different corpora in English and French.
arXiv Detail & Related papers (2022-06-17T14:04:24Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis [1.1470070927586016]
WavThruVec is a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation.
We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
arXiv Detail & Related papers (2022-03-31T10:21:08Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech [4.384576489684272]
We propose a novel approach to real-time sequence labeling in speech.
Our model treats speech and its own textual representation as two separate modalities or views.
We show significant gains of jointly learning from the two modalities when compared to text or audio only.
arXiv Detail & Related papers (2020-05-02T12:16:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.