Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation
- URL: http://arxiv.org/abs/2409.13507v1
- Date: Fri, 20 Sep 2024 13:48:48 GMT
- Title: Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation
- Authors: Matthew Caren, Kartik Chandra, Joshua B. Tenenbaum, Jonathan Ragan-Kelley, Karima Ma,
- Abstract summary: We present a method for producing human-like vocal imitations of sounds.
We first try generating vocal imitations by tuning the model's control parameters.
We apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners.
- Score: 44.50441058435848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method for automatically producing human-like vocal imitations of sounds: the equivalent of "sketching," but for auditory rather than visual representation. Starting with a simulated model of the human vocal tract, we first try generating vocal imitations by tuning the model's control parameters to make the synthesized vocalization match the target sound in terms of perceptually-salient auditory features. Then, to better match human intuitions, we apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners. Finally, we show through several experiments and user studies that when we add this type of communicative reasoning to our method, it aligns with human intuitions better than matching auditory features alone does. This observation has broad implications for the study of depiction in computer graphics.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Repeat after me: Self-supervised learning of acoustic-to-articulatory
mapping by vocal imitation [9.416401293559112]
We propose a computational model of speech production combining a pre-trained neural articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters.
Both forward and inverse models are jointly trained in a self-supervised way from raw acoustic-only speech data from different speakers.
The imitation simulations are evaluated objectively and subjectively and display quite encouraging performances.
arXiv Detail & Related papers (2022-04-05T15:02:49Z) - Controlled AutoEncoders to Generate Faces from Voices [30.062970046955577]
We propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation.
We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
arXiv Detail & Related papers (2021-07-16T16:04:29Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Predicting Emotions Perceived from Sounds [2.9398911304923447]
Sonification is the science of communication of data and events to users through sounds.
This paper conducts an experiment through which several mainstream and conventional machine learning algorithms are developed.
It is possible to predict perceived emotions with high accuracy.
arXiv Detail & Related papers (2020-12-04T15:01:59Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Parametric Representation for Singing Voice Synthesis: a Comparative
Evaluation [10.37199090634032]
The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical parametric synthesis.
The artifacts occurring in high-pitched voices are discussed and possible approaches to overcome them are suggested.
arXiv Detail & Related papers (2020-06-07T13:06:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.