AudioViewer: Learning to Visualize Sound
- URL: http://arxiv.org/abs/2012.13341v3
- Date: Thu, 11 Mar 2021 19:51:23 GMT
- Title: AudioViewer: Learning to Visualize Sound
- Authors: Yuchi Zhang, Willis Peng, Bastian Wandt and Helge Rhodin
- Abstract summary: We aim to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech.
Our design is to translate from audio to video by compressing both into a common latent space with shared structure.
- Score: 12.71759722609666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sensory substitution can help persons with perceptual deficits. In this work,
we attempt to visualize audio with video. Our long-term goal is to create sound
perception for hearing impaired people, for instance, to facilitate feedback
for training deaf speech. Different from existing models that translate between
speech and text or text and images, we target an immediate and low-level
translation that applies to generic environment sounds and human speech without
delay. No canonical mapping is known for this artificial translation task. Our
design is to translate from audio to video by compressing both into a common
latent space with shared structure. Our core contribution is the development
and evaluation of learned mappings that respect human perception limits and
maximize user comfort by enforcing priors and combining strategies from
unpaired image translation and disentanglement. We demonstrate qualitatively
and quantitatively that our AudioViewer model maintains important audio
features in the generated video and that generated videos of faces and numbers
are well suited for visualizing high-dimensional audio features since they can
easily be parsed by humans to match and distinguish between sounds, words, and
speakers.
Related papers
- Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.