When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds
- URL: http://arxiv.org/abs/2505.24336v1
- Date: Fri, 30 May 2025 08:24:41 GMT
- Title: When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds
- Authors: Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho,
- Abstract summary: Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations.<n>We introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices.<n> Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS.
- Score: 2.0999222360659613
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres. Demo samples are available at https://nc-ai.github.io/speech/publications/nonhuman-vc/
Related papers
- AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers [83.90298286498306]
Existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics.<n>We propose AudCast, a general audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm.<n>Our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.
arXiv Detail & Related papers (2025-03-25T16:38:23Z) - AV-Flow: Transforming Text to Audio-Visual Human-like Interactions [101.31009576033776]
AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input.<n>We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
arXiv Detail & Related papers (2025-02-18T18:56:18Z) - Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation [44.50441058435848]
We present a method for producing human-like vocal imitations of sounds.
We first try generating vocal imitations by tuning the model's control parameters.
We apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners.
arXiv Detail & Related papers (2024-09-20T13:48:48Z) - Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and
Textually Described Voices [28.998590651956153]
We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion.
We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion.
Results are more mixed for the musical instrument and text-to-voice conversion tasks.
arXiv Detail & Related papers (2023-10-12T08:00:25Z) - Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale [58.46845567087977]
Voicebox is the most versatile text-guided generative model for speech at scale.
It can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation.
It outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
arXiv Detail & Related papers (2023-06-23T16:23:24Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Speak Like a Dog: Human to Non-human creature Voice Conversion [19.703397078178]
H2NH-VC aims to convert human speech into non-human creature-like speech.
To clarify the possibilities and characteristics of the "speak like a dog" task, we conducted a comparative experiment.
The converted voices were evaluated using mean opinion scores: dog-likeness, sound quality and intelligibility, and character error rate (CER)
arXiv Detail & Related papers (2022-06-09T22:10:43Z) - Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition [13.373579620368046]
We have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs.
Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material.
arXiv Detail & Related papers (2022-05-06T18:08:18Z) - VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z) - Toward Degradation-Robust Voice Conversion [94.60503904292916]
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training.
It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.
We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
arXiv Detail & Related papers (2021-10-14T17:00:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.