Simultaneous or Sequential Training? How Speech Representations
Cooperate in a Multi-Task Self-Supervised Learning System
- URL: http://arxiv.org/abs/2306.02972v1
- Date: Mon, 5 Jun 2023 15:35:19 GMT
- Title: Simultaneous or Sequential Training? How Speech Representations
Cooperate in a Multi-Task Self-Supervised Learning System
- Authors: Khazar Khorrami, Mar\'ia Andrea Cruz Bland\'on, Tuomas Virtanen, Okko
R\"as\"anen
- Abstract summary: Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning.
We study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system.
- Score: 12.704529528199064
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech representation learning with self-supervised algorithms has resulted
in notable performance boosts in many downstream tasks. Recent work combined
self-supervised learning (SSL) and visually grounded speech (VGS) processing
mechanisms for representation learning. The joint training with SSL and VGS
mechanisms provides the opportunity to utilize both unlabeled speech and
speech-related visual information based on data availability. This has shown to
enhance the quality of learned representations, especially at encoding
semantic- and lexical-level knowledge. In this work, we further study the joint
optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task
learning system. We explore a set of training scenarios to understand how
speech representations are shared or transferred between the two tasks, and
what is the optimal training strategy for cross-modal semantic retrieval and
phoneme discrimination performance. As a result, we find that sequential
training with wav2vec 2.0 first and VGS next provides higher performance on
audio-visual retrieval compared to simultaneous optimization of both learning
mechanisms. However, the parallel SSL-VGS training reduces the effects of
catastrophic forgetting when switching between optimization criteria. Moreover,
the results suggest that phonemic representations learned through the VGS
mechanism may generalize better across datasets compared to those learned with
SSL.
Related papers
- Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation [7.124066540020968]
Audio-Visual (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic (AVSS) pursues semantic understanding of audio-visual scenes.
Previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization.
We propose a two-stage training strategy called textitStepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization
arXiv Detail & Related papers (2024-07-16T15:08:30Z) - Sequential Contrastive Audio-Visual Learning [12.848371604063168]
We propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances.
Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV.
We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs.
arXiv Detail & Related papers (2024-07-08T09:45:20Z) - Improved Baselines for Data-efficient Perceptual Augmentation of LLMs [66.05826802808177]
In computer vision, large language models (LLMs) can be used to prime vision-language tasks such as image captioning and visual question answering.
We present an experimental evaluation of different interfacing mechanisms, across multiple tasks.
We identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
arXiv Detail & Related papers (2024-03-20T10:57:17Z) - Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM)
VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation.
We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications.
We develop a two-branch deep network (KWS branch and SV branch) with the same network structure.
A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.