Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
- URL: http://arxiv.org/abs/2410.04579v5
- Date: Sun, 09 Mar 2025 23:07:33 GMT
- Title: Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets
- Authors: Tianjian Li, Haoran Xu, Weiting Tan, Kenton Murray, Daniel Khashabi,
- Abstract summary: Two common strategies to address this disparity are upsampling low-resource data and upweighting low-resource loss.<n>We identify when these two methods are equivalent and when they diverge.<n>We propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence.
- Score: 31.5733738608772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation compared to Scalarization, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting -- achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.
Related papers
- Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers [49.97755400231656]
We present the first performance guarantee with explicit dimensional general score-mismatched diffusion samplers.
We show that score mismatches result in an distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions.
This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise.
arXiv Detail & Related papers (2024-10-17T16:42:12Z) - Capturing Climatic Variability: Using Deep Learning for Stochastic Downscaling [0.0]
Adapting to the changing climate requires accurate local climate information.
Capturing variability while downscaling is crucial for estimating uncertainty and characterising extreme events.
We propose approaches to improve the calibration of GANs in three ways.
arXiv Detail & Related papers (2024-05-31T03:04:10Z) - Fairness under Covariate Shift: Improving Fairness-Accuracy tradeoff
with few Unlabeled Test Samples [21.144077993862652]
We operate in the unsupervised regime where only a small set of unlabeled test samples along with a labeled training set is available.
We experimentally verify that optimizing with our loss formulation outperforms a number of state-of-the-art baselines.
We show that our proposed method significantly outperforms them.
arXiv Detail & Related papers (2023-10-11T14:39:51Z) - Robustness May be More Brittle than We Think under Different Degrees of
Distribution Shifts [72.90906474654594]
We show that robustness of models can be quite brittle and inconsistent under different degrees of distribution shifts.
We observe that large-scale pre-trained models, such as CLIP, are sensitive to even minute distribution shifts of novel downstream tasks.
arXiv Detail & Related papers (2023-10-10T13:39:18Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - When are ensembles really effective? [49.37269057899679]
We study the question of when ensembling yields significant performance improvements in classification tasks.
We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate.
We identify practical scenarios where ensembling does and does not result in large performance improvements.
arXiv Detail & Related papers (2023-05-21T01:36:25Z) - PA&DA: Jointly Sampling PAth and DAta for Consistent NAS [8.737995937682271]
One-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models.
Large gradient variance occurs during supernet training, which degrades the supernet ranking consistency.
We propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta.
arXiv Detail & Related papers (2023-02-28T17:14:24Z) - DensePure: Understanding Diffusion Models towards Adversarial Robustness [110.84015494617528]
We analyze the properties of diffusion models and establish the conditions under which they can enhance certified robustness.
We propose a new method DensePure, designed to improve the certified robustness of a pretrained model (i.e. a classifier)
We show that this robust region is a union of multiple convex sets, and is potentially much larger than the robust regions identified in previous works.
arXiv Detail & Related papers (2022-11-01T08:18:07Z) - Learning to Re-weight Examples with Optimal Transport for Imbalanced
Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models.
One of the most widely-used approaches for tackling imbalanced data is re-weighting.
We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - On the Importance of Sampling in Learning Graph Convolutional Networks [13.713485304798368]
Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications.
Despite their success, training GCNs on large graphs suffers from computational and memory issues.
We describe and analyze a general textbftextitdoubly variance reduction schema that can accelerate any sampling method under the memory budget.
arXiv Detail & Related papers (2021-03-03T21:31:23Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - DEMI: Discriminative Estimator of Mutual Information [5.248805627195347]
Estimating mutual information between continuous random variables is often intractable and challenging for high-dimensional data.
Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information.
Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution.
arXiv Detail & Related papers (2020-10-05T04:19:27Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - Minimal Variance Sampling with Provable Guarantees for Fast Training of
Graph Neural Networks [22.618779809748435]
Existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization.
We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance.
We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization.
arXiv Detail & Related papers (2020-06-24T16:49:29Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z) - Minority Class Oversampling for Tabular Data with Deep Generative Models [4.976007156860967]
We study the ability of deep generative models to provide realistic samples that improve performance on imbalanced classification tasks via oversampling.
Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely.
We also observe that the improvements in terms of performance metric, while shown to be significant, often are minor in absolute terms.
arXiv Detail & Related papers (2020-05-07T21:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.