Speech representation learning: Learning bidirectional encoders with
single-view, multi-view, and multi-task methods
- URL: http://arxiv.org/abs/2308.00129v1
- Date: Tue, 25 Jul 2023 20:38:55 GMT
- Title: Speech representation learning: Learning bidirectional encoders with
single-view, multi-view, and multi-task methods
- Authors: Qingming Tang
- Abstract summary: This thesis focuses on representation learning for sequence data over time or space.
It aims to improve downstream sequence prediction tasks by using the learned representations.
- Score: 7.1345443932276424
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This thesis focuses on representation learning for sequence data over time or
space, aiming to improve downstream sequence prediction tasks by using the
learned representations. Supervised learning has been the most dominant
approach for training deep neural networks for learning good sequential
representations. However, one limiting factor to scale supervised learning is
the lack of enough annotated data. Motivated by this challenge, it is natural
to explore representation learning methods that can utilize large amounts of
unlabeled and weakly labeled data, as well as an additional data modality. I
describe my broad study of representation learning for speech data. Unlike most
other works that focus on a single learning setting, this thesis studies
multiple settings: supervised learning with auxiliary losses, unsupervised
learning, semi-supervised learning, and multi-view learning. Besides different
learning problems, I also explore multiple approaches for representation
learning. Though I focus on speech data, the methods described in this thesis
can also be applied to other domains. Overall, the field of representation
learning is developing rapidly. State-of-the-art results on speech related
tasks are typically based on Transformers pre-trained with large-scale
self-supervised learning, which aims to learn generic representations that can
benefit multiple downstream tasks. Since 2020, large-scale pre-training has
been the de facto choice to achieve good performance. This delayed thesis does
not attempt to summarize and compare with the latest results on speech
representation learning; instead, it presents a unique study on speech
representation learning before the Transformer era, that covers multiple
learning settings. Some of the findings in this thesis can still be useful
today.
Related papers
- Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - The Trade-off between Universality and Label Efficiency of
Representations from Contrastive Learning [32.15608637930748]
We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously.
We provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks, it puts less emphasis on task-specific features.
arXiv Detail & Related papers (2023-02-28T22:14:33Z) - Brief Introduction to Contrastive Learning Pretext Tasks for Visual
Representation [0.0]
We introduce contrastive learning, a subset of unsupervised learning methods.
The purpose of contrastive learning is to embed augmented samples from the same sample near to each other while pushing away those that are not.
We offer some strategies from contrastive learning that have recently been published and are focused on pretext tasks for visual representation.
arXiv Detail & Related papers (2022-10-06T18:54:10Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Learning Downstream Task by Selectively Capturing Complementary
Knowledge from Multiple Self-supervisedly Learning Pretexts [20.764378638979704]
We propose a novel solution by leveraging the attention mechanism to adaptively squeeze suitable representations for the tasks.
Our scheme significantly exceeds current popular pretext-matching based methods in gathering knowledge.
arXiv Detail & Related papers (2022-04-11T16:46:50Z) - X-Learner: Learning Cross Sources and Tasks for Universal Visual
Representation [71.51719469058666]
We propose a representation learning framework called X-Learner.
X-Learner learns the universal feature of multiple vision tasks supervised by various sources.
X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs.
arXiv Detail & Related papers (2022-03-16T17:23:26Z) - Playful Interactions for Representation Learning [82.59215739257104]
We propose to use playful interactions in a self-supervised manner to learn visual representations for downstream tasks.
We collect 2 hours of playful data in 19 diverse environments and use self-predictive learning to extract visual representations.
Our representations generalize better than standard behavior cloning and can achieve similar performance with only half the number of required demonstrations.
arXiv Detail & Related papers (2021-07-19T17:54:48Z) - An analysis on the use of autoencoders for representation learning:
fundamentals, learning task case studies, explainability and challenges [11.329636084818778]
In many machine learning tasks, learning a good representation of the data can be the key to building a well-performant solution.
We present a series of learning tasks: data embedding for visualization, image denoising, semantic hashing, detection of abnormal behaviors and instance generation.
A solution is proposed for each task employing autoencoders as the only learning method.
arXiv Detail & Related papers (2020-05-21T08:41:57Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z) - Laplacian Denoising Autoencoder [114.21219514831343]
We propose to learn data representations with a novel type of denoising autoencoder.
The noisy input data is generated by corrupting latent clean data in the gradient domain.
Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach.
arXiv Detail & Related papers (2020-03-30T16:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.