Related papers: Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

URL: http://arxiv.org/abs/2406.12998v1
Date: Tue, 18 Jun 2024 18:38:17 GMT
Title: Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech
Authors: Cheol Jun Cho, Peter Wu, Tejas S. Prabhune, Dhruv Agarwal, Gopala K. Anumanchipalli,
Abstract summary: Articulatory encodec is a new framework of neural-decoding of speech. It infers articulatory features from speech audio, and synthesizes speech audio from articulatory features. This is the first demonstration of universal, high-performance articulatory inference and synthesis.
Score: 5.0751585360524425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.

Related papers

NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations [7.55995559331834]
Paralinguistic vocalizations are integral to natural spoken communication.<n> NVSpeech bridges the recognition and synthesis of paralinguistic vocalizations.<n> NVSpeech offers the first open, large-scale, word-level annotated pipeline for expressive speech modeling in Mandarin.
arXiv Detail & Related papers (2025-08-06T08:25:26Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z)
Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems. We introduce spoken language understanding modules to extract speaker-related semantic information. We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations. We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z)
Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention [0.0]
We propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention. We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances.
arXiv Detail & Related papers (2022-01-25T15:06:07Z)
NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance. A studio-quality corpus with manual transcription is necessary to train such seq2seq systems. We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.