Related papers: Exploration strategies for articulatory synthesis of complex syllable onsets

Exploration strategies for articulatory synthesis of complex syllable onsets

URL: http://arxiv.org/abs/2204.09381v1
Date: Wed, 20 Apr 2022 10:47:28 GMT
Title: Exploration strategies for articulatory synthesis of complex syllable onsets
Authors: Daniel R. van Niekerk, Anqi Xu, Branislav Gerazov, Paul K. Krug, Peter Birkholz, Yi Xu
Abstract summary: High-quality articulatory speech synthesis has many potential applications in speech science and technology. We construct an optimisation-based framework as a first step towards learning these mappings without manual intervention.
Score: 20.422871314256266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings without manual intervention. We demonstrate the production of syllables with complex onsets and discuss the quality of the articulatory gestures with reference to coarticulation.

Related papers

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis [19.764460501254607]
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. We propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis.
arXiv Detail & Related papers (2024-12-21T10:16:07Z)
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2. Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens. We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit. Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z)
Expressivity and Speech Synthesis [51.75420054449122]
We outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology.
arXiv Detail & Related papers (2024-04-30T08:47:24Z)
Unified speech and gesture synthesis using flow matching [24.2094371314481]
This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text. The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures.
arXiv Detail & Related papers (2023-10-08T14:37:28Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z)
Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics. We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context. Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z)
Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings [27.352570417976153]
We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory.
arXiv Detail & Related papers (2022-10-04T08:19:06Z)
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)
Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis [37.93814851450597]
We propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information. To encourage this encoder to learn style representation better, we introduce a novel training strategy.
arXiv Detail & Related papers (2022-03-23T05:27:57Z)
Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)
Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic Speech Synthesis [59.623780036359655]
Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators. This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury. We propose a solution to this problem based on the theory of multi-view learning.
arXiv Detail & Related papers (2020-12-30T15:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.