Provable Stochastic Optimization for Global Contrastive Learning: Small
Batch Does Not Harm Performance
- URL: http://arxiv.org/abs/2202.12387v1
- Date: Thu, 24 Feb 2022 22:16:53 GMT
- Title: Provable Stochastic Optimization for Global Contrastive Learning: Small
Batch Does Not Harm Performance
- Authors: Zhuoning Yuan, Yuexin Wu, Zihao Qiu, Xianzhi Du, Lijun Zhang, Denny
Zhou, Tianbao Yang
- Abstract summary: We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point.
Existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result.
We propose a memory-efficient optimization algorithm for solving the Global Contrastive Learning of Representations, named SogCLR.
- Score: 53.49803579981569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study contrastive learning from an optimization
perspective, aiming to analyze and address a fundamental issue of existing
contrastive learning methods that either rely on a large batch size or a large
dictionary. We consider a global objective for contrastive learning, which
contrasts each positive pair with all negative pairs for an anchor point. From
the optimization perspective, we explain why existing methods such as SimCLR
requires a large batch size in order to achieve a satisfactory result. In order
to remove such requirement, we propose a memory-efficient Stochastic
Optimization algorithm for solving the Global objective of Contrastive Learning
of Representations, named SogCLR. We show that its optimization error is
negligible under a reasonable condition after a sufficient number of iterations
or is diminishing for a slightly different global contrastive objective.
Empirically, we demonstrate that on ImageNet with a batch size 256, SogCLR
achieves a performance of 69.4% for top-1 linear evaluation accuracy using
ResNet-50, which is on par with SimCLR (69.3%) with a large batch size 8,192.
We also attempt to show that the proposed optimization technique is generic and
can be applied to solving other contrastive losses, e.g., two-way contrastive
losses for bimodal contrastive learning.
Related papers
- Jacobian Descent for Multi-Objective Optimization [0.6138671548064355]
gradient descent is limited to single-objective optimization.
Jacobian descent (JD) iteratively updates parameters using the Jacobian matrix of a vector-valued objective function.
arXiv Detail & Related papers (2024-06-23T22:06:25Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Mini-Batch Optimization of Contrastive Loss [13.730030395850358]
We show that mini-batch optimization is equivalent to full-batch optimization if and only if all $binomNB$ mini-batches are selected.
We then demonstrate that utilizing high-loss mini-batches can speed up SGD convergence and propose a spectral clustering-based approach for identifying these high-loss mini-batches.
arXiv Detail & Related papers (2023-07-12T04:23:26Z) - Supervised Contrastive Learning as Multi-Objective Optimization for
Fine-Tuning Large Pre-trained Language Models [3.759936323189417]
Supervised Contrastive Learning (SCL) has been shown to achieve excellent performance in most classification tasks.
In this work, we formulate the SCL problem as a Multi-Objective Optimization problem for the fine-tuning phase of RoBERTa language model.
arXiv Detail & Related papers (2022-09-28T15:13:58Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Max-Margin Contrastive Learning [120.32963353348674]
We present max-margin contrastive learning (MMCL) for unsupervised representation learning.
Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem.
We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning.
arXiv Detail & Related papers (2021-12-21T18:56:54Z) - MIO : Mutual Information Optimization using Self-Supervised Binary
Contrastive Learning [19.5917119072985]
We model contrastive learning into a binary classification problem to predict if a pair is positive or not.
The proposed method outperforms the state-of-the-art algorithms on benchmark datasets like STL-10, CIFAR-10, CIFAR-100.
arXiv Detail & Related papers (2021-11-24T17:51:29Z) - Decoupled Contrastive Learning [23.25775900388382]
We identify a noticeable negative-positive-coupling (NPC) effect in the widely used cross-entropy (InfoNCE) loss.
By properly addressing the NPC effect, we reach a decoupled contrastive learning (DCL) objective function.
Our approach achieves $66.9%$ ImageNet top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its baseline SimCLR by $5.1%$.
arXiv Detail & Related papers (2021-10-13T16:38:43Z) - EqCo: Equivalent Rules for Self-supervised Contrastive Learning [81.45848885547754]
We propose a method to make self-supervised learning irrelevant to the number of negative samples in InfoNCE-based contrastive learning frameworks.
Inspired by the InfoMax principle, we point that the margin term in contrastive loss needs to be adaptively scaled according to the number of negative pairs.
arXiv Detail & Related papers (2020-10-05T11:39:04Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.