Self-Supervised Representation Learning for Speech Using Visual
Grounding and Masked Language Modeling
- URL: http://arxiv.org/abs/2202.03543v1
- Date: Mon, 7 Feb 2022 22:09:54 GMT
- Title: Self-Supervised Representation Learning for Speech Using Visual
Grounding and Masked Language Modeling
- Authors: Puyuan Peng and David Harwath
- Abstract summary: FaST-VGS is a Transformer-based model that learns to associate raw speech waveforms with semantically related images.
FaST-VGS+ is learned in a multi-task fashion with a masked language modeling objective.
We show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task.
- Score: 13.956691231452336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge
and SUPERB benchmark. Our submissions are based on the recently proposed
FaST-VGS model, which is a Transformer-based model that learns to associate raw
speech waveforms with semantically related images, all without the use of any
transcriptions of the speech. Additionally, we introduce a novel extension of
this model, FaST-VGS+, which is learned in a multi-task fashion with a masked
language modeling objective in addition to the visual grounding objective. On
ZeroSpeech 2021, we show that our models perform competitively on the ABX task,
outperform all other concurrent submissions on the Syntactic and Semantic
tasks, and nearly match the best system on the Lexical task. On the SUPERB
benchmark, we show that our models also achieve strong performance, in some
cases even outperforming the popular wav2vec2.0 model.
Related papers
- Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Self-Supervised Models of Speech Infer Universal Articulatory Kinematics [44.27187669492598]
We show "inference of articulatory kinematics" as fundamental property of SSL models.
We also show that this abstraction is largely overlapping across the language of the data used to train the model.
We show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects.
arXiv Detail & Related papers (2023-10-16T19:50:01Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 [16.297790031478634]
We introduce the DiffuseStyleGesture+, our solution for the Generation and Evaluation of Non-verbal Behavior for Embodied Agents (GENEA) Challenge 2023.
Our proposed model, DiffuseStyleGesture+, leverages a diffusion model to generate gestures automatically.
It incorporates a variety of modalities, including audio, text, speaker ID, and seed gestures.
arXiv Detail & Related papers (2023-08-26T13:34:17Z) - StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion
and Adversarial Training with Large Speech Language Models [19.029030168939354]
StyleTTS 2 is a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis.
StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers.
This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.
arXiv Detail & Related papers (2023-06-13T11:04:43Z) - Syllable Discovery and Cross-Lingual Generalization in a Visually
Grounded, Self-Supervised Speech Model [21.286529902957724]
We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.
We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
arXiv Detail & Related papers (2023-05-19T05:19:04Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Fast-Slow Transformer for Visually Grounding Speech [15.68151998164009]
We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.
FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images.
arXiv Detail & Related papers (2021-09-16T18:45:45Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.