Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement
- URL: http://arxiv.org/abs/2401.04511v1
- Date: Tue, 9 Jan 2024 12:10:04 GMT
- Title: Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement
- Authors: Soumya Dutta and Sriram Ganapathy
- Abstract summary: We propose an efficient approach, termed as Zero-shot Emotion Style Transfer (ZEST)
The proposed system builds upon decomposing speech into semantic tokens, speaker representations and emotion embeddings.
We show that, even without using parallel training data or labels from the source or target audio, we illustrate zero shot emotion transfer capabilities of the proposed ZEST model.
- Score: 41.837538440839815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The problem of audio-to-audio (A2A) style transfer involves replacing the
style features of the source audio with those from the target audio while
preserving the content related attributes of the source audio. In this paper,
we propose an efficient approach, termed as Zero-shot Emotion Style Transfer
(ZEST), that allows the transfer of emotional content present in the given
source audio with the one embedded in the target audio while retaining the
speaker and speech content from the source. The proposed system builds upon
decomposing speech into semantic tokens, speaker representations and emotion
embeddings. Using these factors, we propose a framework to reconstruct the
pitch contour of the given speech signal and train a decoder that reconstructs
the speech signal. The model is trained using a self-supervision based
reconstruction loss. During conversion, the emotion embedding is alone derived
from the target audio, while rest of the factors are derived from the source
audio. In our experiments, we show that, even without using parallel training
data or labels from the source or target audio, we illustrate zero shot emotion
transfer capabilities of the proposed ZEST model using objective and subjective
quality evaluations.
Related papers
- Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - Catch You and I Can: Revealing Source Voiceprint Against Voice
Conversion [0.0]
We make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit.
We develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples.
arXiv Detail & Related papers (2023-02-24T03:33:13Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.