Related papers: Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning

Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning

URL: http://arxiv.org/abs/2010.13991v2
Date: Mon, 5 Jul 2021 03:13:25 GMT
Title: Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning
Authors: Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li
Abstract summary: Speech SimCLR is a new self-supervised objective for speech representation learning. During training, SimCLR applies augmentation on raw speech and its spectrogram.
Score: 20.39971017940006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised visual pretraining has shown significant progress recently. Among those methods, SimCLR greatly advanced the state of the art in self-supervised and semi-supervised learning on ImageNet. The input feature representations for speech and visual tasks are both continuous, so it is natural to consider applying similar objective on speech representation learning. In this paper, we propose Speech SimCLR, a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. The proposed method achieved competitive results on speech emotion recognition and speech recognition.

Related papers

Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z)
Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z)
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction [13.5641621193917]
In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. We introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements.
arXiv Detail & Related papers (2024-04-19T09:08:44Z)
CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition [5.974778743092437]
CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations. Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
arXiv Detail & Related papers (2024-02-10T11:13:13Z)
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning. Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z)
Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations. We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features. We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z)
Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.