Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language
- URL: http://arxiv.org/abs/2212.07525v2
- Date: Thu, 15 Jun 2023 15:19:22 GMT
- Title: Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language
- Authors: Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli
- Abstract summary: data2vec is a learning objective that generalizes across several modalities.
We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time.
- Score: 60.12197397018094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current self-supervised learning algorithms are often modality-specific and
require large amounts of computational resources. To address these issues, we
increase the training efficiency of data2vec, a learning objective that
generalizes across several modalities. We do not encode masked tokens, use a
fast convolutional decoder and amortize the effort to build teacher
representations. data2vec 2.0 benefits from the rich contextualized target
representations introduced in data2vec which enable a fast self-supervised
learner. Experiments on ImageNet-1K image classification show that data2vec 2.0
matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time,
on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x
less time, and on GLUE natural language understanding it matches a retrained
RoBERTa model in half the time. Trading some speed for accuracy results in
ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
Related papers
- TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens [9.453667770656644]
We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens.
We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding.
We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale.
arXiv Detail & Related papers (2024-10-07T17:58:35Z) - Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens [87.52235889917223]
We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model.
With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
arXiv Detail & Related papers (2023-09-15T16:48:34Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Wav2vec-C: A Self-supervised Model for Speech Representation Learning [40.47940210640496]
Wav2vec-C is a representation learning technique combining elements from wav2vec 2.0 and VQ-VAE.
The proposed self-supervised model is trained on 10k hours of unlabeled data and fine-tuned with 1k hours of labeled data.
arXiv Detail & Related papers (2021-03-09T16:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.