Related papers: When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

URL: http://arxiv.org/abs/2505.24336v1
Date: Fri, 30 May 2025 08:24:41 GMT
Title: When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds
Authors: Minsu Kang, Seolhee Lee, Choonghyeon Lee, Namhyun Cho,
Abstract summary: Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations.<n>We introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices.<n> Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS.
Score: 2.0999222360659613
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Human to non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accomodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and non-human voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse non-human timbres. Demo samples are available at https://nc-ai.github.io/speech/publications/nonhuman-vc/

Related papers

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers [83.90298286498306]
Existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics.<n>We propose AudCast, a general audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm.<n>Our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details.
arXiv Detail & Related papers (2025-03-25T16:38:23Z)
AV-Flow: Transforming Text to Audio-Visual Human-like Interactions [101.31009576033776]
AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input.<n>We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
arXiv Detail & Related papers (2025-02-18T18:56:18Z)
Sketching With Your Voice: "Non-Phonorealistic" Rendering of Sounds via Vocal Imitation [44.50441058435848]
We present a method for producing human-like vocal imitations of sounds. We first try generating vocal imitations by tuning the model's control parameters. We apply a cognitive theory of communication to take into account how human speakers reason strategically about their listeners.
arXiv Detail & Related papers (2024-09-20T13:48:48Z)
Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices [28.998590651956153]
We look at four non-standard applications: stuttered voice conversion, cross-lingual voice conversion, musical instrument conversion, and text-to-voice conversion. We find that kNN-VC retains high performance in stuttered and cross-lingual voice conversion. Results are more mixed for the musical instrument and text-to-voice conversion tasks.
arXiv Detail & Related papers (2023-10-12T08:00:25Z)
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale [58.46845567087977]
Voicebox is the most versatile text-guided generative model for speech at scale. It can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. It outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.
arXiv Detail & Related papers (2023-06-23T16:23:24Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture. We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z)
Speak Like a Dog: Human to Non-human creature Voice Conversion [19.703397078178]
H2NH-VC aims to convert human speech into non-human creature-like speech. To clarify the possibilities and characteristics of the "speak like a dog" task, we conducted a comparative experiment. The converted voices were evaluated using mean opinion scores: dog-likeness, sound quality and intelligibility, and character error rate (CER)
arXiv Detail & Related papers (2022-06-09T22:10:43Z)
Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition [13.373579620368046]
We have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs. Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material.
arXiv Detail & Related papers (2022-05-06T18:08:18Z)
VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation. It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z)
Toward Degradation-Robust Voice Conversion [94.60503904292916]
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
arXiv Detail & Related papers (2021-10-14T17:00:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.