StreamVC: Real-Time Low-Latency Voice Conversion
- URL: http://arxiv.org/abs/2401.03078v1
- Date: Fri, 5 Jan 2024 22:37:26 GMT
- Title: StreamVC: Real-Time Low-Latency Voice Conversion
- Authors: Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George
Sung, Matthias Grundmann
- Abstract summary: StreamVC is a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech.
StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform.
- Score: 20.164321451712564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present StreamVC, a streaming voice conversion solution that preserves the
content and prosody of any source speech while matching the voice timbre from
any target speech. Unlike previous approaches, StreamVC produces the resulting
waveform at low latency from the input signal even on a mobile platform, making
it applicable to real-time communication scenarios like calls and video
conferencing, and addressing use cases such as voice anonymization in these
scenarios. Our design leverages the architecture and training strategy of the
SoundStream neural audio codec for lightweight high-quality speech synthesis.
We demonstrate the feasibility of learning soft speech units causally, as well
as the effectiveness of supplying whitened fundamental frequency information to
improve pitch stability without leaking the source timbre information.
Related papers
- Zero-shot Voice Conversion with Diffusion Transformers [0.0]
Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker.
Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks.
We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training.
arXiv Detail & Related papers (2024-11-15T04:43:44Z) - IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Vocoder-Based Speech Synthesis from Silent Videos [28.94460283719776]
We present a way to synthesise speech from the silent video of a talker using deep learning.
The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.
arXiv Detail & Related papers (2020-04-06T10:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.