How Well Self-Supervised Pre-Training Performs with Streaming Data?
- URL: http://arxiv.org/abs/2104.12081v1
- Date: Sun, 25 Apr 2021 06:56:48 GMT
- Title: How Well Self-Supervised Pre-Training Performs with Streaming Data?
- Authors: Dapeng Hu, Qizhengqiu Lu, Lanqing Hong, Hailin Hu, Yifan Zhang,
Zhenguo Li, Alfred Shen, Jiashi Feng
- Abstract summary: In real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming.
It is unclear how well sequential self-supervised pre-training performs with streaming data.
We find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild.
- Score: 73.5362286533602
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The common self-supervised pre-training practice requires collecting massive
unlabeled data together and then trains a representation model, dubbed
\textbf{joint training}. However, in real-world scenarios where data are
collected in a streaming fashion, the joint training scheme is usually
storage-heavy and time-consuming. A more efficient alternative is to train a
model continually with streaming data, dubbed \textbf{sequential training}.
Nevertheless, it is unclear how well sequential self-supervised pre-training
performs with streaming data. In this paper, we conduct thorough experiments to
investigate self-supervised pre-training with streaming data. Specifically, we
evaluate the transfer performance of sequential self-supervised pre-training
with four different data sequences on three different downstream tasks and make
comparisons with joint self-supervised pre-training. Surprisingly, we find
sequential self-supervised learning exhibits almost the same performance as the
joint training when the distribution shifts within streaming data are mild.
Even for data sequences with large distribution shifts, sequential
self-supervised training with simple techniques, e.g., parameter regularization
or data replay, still performs comparably to joint training. Based on our
findings, we recommend using sequential self-supervised training as a
\textbf{more efficient yet performance-competitive} representation learning
practice for real-world applications.
Related papers
- Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training.
We focus on obtaining a "clean" training signal from real-world unlabelled video.
We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - SOTASTREAM: A Streaming Approach to Machine Translation Training [13.39347756245191]
Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.
We propose an alternative approach that separates the generation of data from the consumption of that data.
In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data.
arXiv Detail & Related papers (2023-08-14T22:47:19Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - The Challenges of Continuous Self-Supervised Learning [40.941767578622745]
Self-supervised learning (SSL) aims to eliminate one of the major bottlenecks in representation learning - the need for human annotations.
We show that a direct application of current methods to such continuous setup is inefficient both computationally and in the amount of data required.
We propose the use of replay buffers as an approach to alleviate the issues of inefficiency and temporal correlations.
arXiv Detail & Related papers (2022-03-23T20:05:06Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Unshuffling Data for Improved Generalization [65.57124325257409]
Generalization beyond the training distribution is a core challenge in machine learning.
We show that partitioning the data into well-chosen, non-i.i.d. subsets treated as multiple training environments can guide the learning of models with better out-of-distribution generalization.
arXiv Detail & Related papers (2020-02-27T03:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.