Timbre Transfer with Variational Auto Encoding and Cycle-Consistent
Adversarial Networks
- URL: http://arxiv.org/abs/2109.02096v1
- Date: Sun, 5 Sep 2021 15:06:53 GMT
- Title: Timbre Transfer with Variational Auto Encoding and Cycle-Consistent
Adversarial Networks
- Authors: Russell Sammut Bonnici, Charalampos Saitis, Martin Benning
- Abstract summary: This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality.
The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio.
- Score: 0.6445605125467573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This research project investigates the application of deep learning to timbre
transfer, where the timbre of a source audio can be converted to the timbre of
a target audio with minimal loss in quality. The adopted approach combines
Variational Autoencoders with Generative Adversarial Networks to construct
meaningful representations of the source audio and produce realistic
generations of the target audio and is applied to the Flickr 8k Audio dataset
for transferring the vocal timbre between speakers and the URMP dataset for
transferring the musical timbre between instruments. Furthermore, variations of
the adopted approach are trained, and generalised performance is compared using
the metrics SSIM (Structural Similarity Index) and FAD (Frech\'et Audio
Distance). It was found that a many-to-many approach supersedes a one-to-one
approach in terms of reconstructive capabilities, and that the adoption of a
basic over a bottleneck residual block design is more suitable for enriching
content information about a latent space. It was also found that the decision
on whether cyclic loss takes on a variational autoencoder or vanilla
autoencoder approach does not have a significant impact on reconstructive and
adversarial translation aspects of the model.
Related papers
- AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual
Voice Conversion [2.3443118032034396]
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing.
Our model outperforms existing state-of-the-art results in both subjective and objective evaluations.
arXiv Detail & Related papers (2023-10-10T11:50:16Z) - Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation [22.28510611697998]
We propose a novel textbfAudio-aware query-enhanced textbfTRansformer (AuTR) to tackle the task.
Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features.
arXiv Detail & Related papers (2023-07-25T03:59:04Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - AudioSlots: A slot-centric generative model for audio separation [26.51135156983783]
We present AudioSlots, a slot-centric generative model for blind source separation in the audio domain.
We train the model in an end-to-end manner using a permutation-equivariant loss function.
Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise.
arXiv Detail & Related papers (2023-05-09T16:28:07Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Robust Semantic Communications with Masked VQ-VAE Enabled Codebook [56.63571713657059]
We propose a framework for the robust end-to-end semantic communication systems to combat the semantic noise.
To combat the semantic noise, the adversarial training with weight is developed to incorporate the samples with semantic noise in the training dataset.
We develop a feature importance module (FIM) to suppress the noise-related and task-unrelated features.
arXiv Detail & Related papers (2022-06-08T16:58:47Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.