Sound and Visual Representation Learning with Multiple Pretraining Tasks
- URL: http://arxiv.org/abs/2201.01046v1
- Date: Tue, 4 Jan 2022 09:09:38 GMT
- Title: Sound and Visual Representation Learning with Multiple Pretraining Tasks
- Authors: Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
- Abstract summary: Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
- Score: 104.11800812671953
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Different self-supervised tasks (SSL) reveal different features from the
data. The learned feature representations can exhibit different performance for
each downstream task. In this light, this work aims to combine Multiple SSL
tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically,
for this study, we investigate binaural sounds and image data in isolation. For
binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal
synchronization of foreground objects and binaural audio and temporal gap
prediction. We investigate several approaches of Multi-SSL and give insights
into the downstream task performance on video retrieval, spatial sound super
resolution, and semantic prediction on the OmniAudio dataset. Our experiments
on binaural sound representations demonstrate that Multi-SSL via incremental
learning (IL) of SSL tasks outperforms single SSL task models and fully
supervised models in the downstream task performance. As a check of
applicability on other modality, we also formulate our Multi-SSL models for
image representation learning and we use the recently proposed SSL tasks,
MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2,
DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83,
+1.56 and +1.61 AP on COCO detection. Code will be made publicly available.
Related papers
- On the Discriminability of Self-Supervised Representation Learning [38.598160031349686]
Self-supervised learning (SSL) has recently achieved significant success in downstream visual tasks.
A notable gap still exists between SSL and supervised learning (SL), especially in complex downstream tasks.
arXiv Detail & Related papers (2024-07-18T14:18:03Z) - Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect [11.013934239276036]
Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks.
This paper contributes by comparing the effectiveness of SSL approaches in the context of the low-resource spoken Tunisian Arabic dialect.
arXiv Detail & Related papers (2024-07-05T14:21:36Z) - Every Node is Different: Dynamically Fusing Self-Supervised Tasks for
Attributed Graph Clustering [59.45743537594695]
We propose Dynamically Fusing Self-Supervised Learning (DyFSS) for graph clustering.
DyFSS fuses features extracted from diverse SSL tasks using distinct weights derived from a gating network.
Experiments show DyFSS outperforms state-of-the-art multi-task SSL methods by up to 8.66% on the accuracy metric.
arXiv Detail & Related papers (2024-01-12T14:24:10Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - What Can an Accent Identifier Learn? Probing Phonetic and Prosodic
Information in a Wav2vec2-based Accent Identification Model [30.88357561791563]
This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning model.
Results show that the AID fine-tuning task steers the top 2 layers to learn richer phoneme and prosody representation.
arXiv Detail & Related papers (2023-06-10T21:20:47Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Evidence of Vocal Tract Articulation in Self-Supervised Learning of
Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech.
We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA)
Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z) - Combining Spectral and Self-Supervised Features for Low Resource Speech
Recognition and Translation [27.857955394020475]
Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks.
The quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain.
We propose a learnable and interpretable framework to combine SF and SSL representations.
arXiv Detail & Related papers (2022-04-05T20:09:15Z) - DATA: Domain-Aware and Task-Aware Pre-training [94.62676913928831]
We present DATA, a simple yet effective NAS approach specialized for self-supervised learning (SSL)
Our method achieves promising results across a wide range of computation costs on downstream tasks, including image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2022-03-17T02:38:49Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.