Face-to-Music Translation Using a Distance-Preserving Generative
Adversarial Network with an Auxiliary Discriminator
- URL: http://arxiv.org/abs/2006.13469v1
- Date: Wed, 24 Jun 2020 04:17:40 GMT
- Title: Face-to-Music Translation Using a Distance-Preserving Generative
Adversarial Network with an Auxiliary Discriminator
- Authors: Chelhwon Kim, Andrew Port, Mitesh Patel
- Abstract summary: We propose a distance-preserving generative adversarial model to translate images of human faces into an audio domain.
The audio domain is defined by a collection of musical note sounds recorded by 10 different instrument families.
To enforce distance-preservation, a loss term that penalizes difference between pairwise distances of the faces and the translated audio samples is used.
- Score: 5.478764356647437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning a mapping between two unrelated domains-such as image and audio,
without any supervision is a challenging task. In this work, we propose a
distance-preserving generative adversarial model to translate images of human
faces into an audio domain. The audio domain is defined by a collection of
musical note sounds recorded by 10 different instrument families (NSynth
\cite{nsynth2017}) and a distance metric where the instrument family class
information is incorporated together with a mel-frequency cepstral coefficients
(MFCCs) feature. To enforce distance-preservation, a loss term that penalizes
difference between pairwise distances of the faces and the translated audio
samples is used. Further, we discover that the distance preservation constraint
in the generative adversarial model leads to reduced diversity in the
translated audio samples, and propose the use of an auxiliary discriminator to
enhance the diversity of the translations while using the distance preservation
constraint. We also provide a visual demonstration of the results and numerical
analysis of the fidelity of the translations. A video demo of our proposed
model's learned translation is available in
https://www.dropbox.com/s/the176w9obq8465/face_to_musical_note.mov?dl=0.
Related papers
- Establishing degrees of closeness between audio recordings along
different dimensions using large-scale cross-lingual models [4.349838917565205]
We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata.
Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects.
The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines.
arXiv Detail & Related papers (2024-02-08T11:31:23Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - AudioViewer: Learning to Visualize Sound [12.71759722609666]
We aim to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech.
Our design is to translate from audio to video by compressing both into a common latent space with shared structure.
arXiv Detail & Related papers (2020-12-22T21:52:45Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Deep Sensory Substitution: Noninvasively Enabling Biological Neural
Networks to Receive Input from Artificial Neural Networks [5.478764356647437]
This work describes a novel technique for leveraging machine-learned feature embeddings to sonify visual information into a perceptual audio domain.
A generative adversarial network (GAN) is then used to find a distance preserving map from this metric space of feature vectors into the metric space defined by a target audio dataset.
In human subject tests, users were able to accurately classify audio sonifications of faces.
arXiv Detail & Related papers (2020-05-27T11:41:48Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.