data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language
- URL: http://arxiv.org/abs/2202.03555v1
- Date: Mon, 7 Feb 2022 22:52:11 GMT
- Title: data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language
- Authors: Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu,
Michael Auli
- Abstract summary: data2vec is a framework that uses the same learning method for either speech, NLP or computer vision.
The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance.
- Score: 85.9019051663368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the general idea of self-supervised learning is identical across
modalities, the actual algorithms and objectives differ widely because they
were developed with a single modality in mind. To get us closer to general
self-supervised learning, we present data2vec, a framework that uses the same
learning method for either speech, NLP or computer vision. The core idea is to
predict latent representations of the full input data based on a masked view of
the input in a self-distillation setup using a standard Transformer
architecture. Instead of predicting modality-specific targets such as words,
visual tokens or units of human speech which are local in nature, data2vec
predicts contextualized latent representations that contain information from
the entire input. Experiments on the major benchmarks of speech recognition,
image classification, and natural language understanding demonstrate a new
state of the art or competitive performance to predominant approaches.
Related papers
- Learning Semantic Information from Raw Audio Signal Using Both
Contextual and Phonetic Representations [18.251845041785906]
We propose a framework to learn semantics from raw audio signals using two types of representations.
We introduce a speech-to-unit processing pipeline that captures two types of representations with different time resolutions.
For the language model, we adopt a dual-channel architecture to incorporate both types of representation.
arXiv Detail & Related papers (2024-02-02T10:39:58Z) - PVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image Recognition [47.11517266162346]
We propose a Prompt-driven Visual-Linguistic Representation Learning framework to better leverage the capabilities of the linguistic modality.
In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features.
arXiv Detail & Related papers (2024-01-31T14:39:11Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.