Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision
- URL: http://arxiv.org/abs/2007.04134v1
- Date: Wed, 8 Jul 2020 14:07:06 GMT
- Title: Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision
- Authors: Abhinav Shukla, Stavros Petridis, Maja Pantic
- Abstract summary: We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
- Score: 63.564385139097624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The intuitive interaction between the audio and visual modalities is valuable
for cross-modal self-supervised learning. This concept has been demonstrated
for generic audiovisual tasks like video action recognition and acoustic scene
classification. However, self-supervision remains under-explored for
audiovisual speech. We propose a method to learn self-supervised speech
representations from the raw audio waveform. We train a raw audio encoder by
combining audio-only self-supervision (by predicting informative audio
attributes) with visual self-supervision (by generating talking faces from
audio). The visual pretext task drives the audio representations to capture
information related to lip movements. This enriches the audio encoder with
visual information and the encoder can be used for evaluation without the
visual modality. Our method attains competitive performance with respect to
existing self-supervised audio features on established isolated word
classification benchmarks, and significantly outperforms other methods at
learning from fewer labels. Notably, our method also outperforms fully
supervised training, thus providing a strong initialization for speech related
tasks. Our results demonstrate the potential of multimodal self-supervision in
audiovisual speech for learning good audio representations.
Related papers
- From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.