Slice Tuner: A Selective Data Acquisition Framework for Accurate and
Fair Machine Learning Models
- URL: http://arxiv.org/abs/2003.04549v3
- Date: Sat, 21 Aug 2021 12:19:45 GMT
- Title: Slice Tuner: A Selective Data Acquisition Framework for Accurate and
Fair Machine Learning Models
- Authors: Ki Hyun Tae, Steven Euijong Whang
- Abstract summary: We propose Slice Tuner to selectively acquire data to ensure model accuracy and fairness.
At its core, Slice Tuner maintains learning curves of slices that estimate the model accuracies given more data.
We show that Slice Tuner significantly outperforms baselines in terms of model accuracy and fairness.
- Score: 10.501265073049447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As machine learning becomes democratized in the era of Software 2.0, a
serious bottleneck is acquiring enough data to ensure accurate and fair models.
Recent techniques including crowdsourcing provide cost-effective ways to gather
such data. However, simply acquiring data as much as possible is not
necessarily an effective strategy for optimizing accuracy and fairness. For
example, if an online app store has enough training data for certain slices of
data (say American customers), but not for others, obtaining more American
customer data will only bias the model training. Instead, we contend that one
needs to selectively acquire data and propose Slice Tuner, which acquires
possibly-different amounts of data per slice such that the model accuracy and
fairness on all slices are optimized. This problem is different than labeling
existing data (as in active learning or weak supervision) because the goal is
obtaining the right amounts of new data. At its core, Slice Tuner maintains
learning curves of slices that estimate the model accuracies given more data
and uses convex optimization to find the best data acquisition strategy. The
key challenges of estimating learning curves are that they may be inaccurate if
there is not enough data, and there may be dependencies among slices where
acquiring data for one slice influences the learning curves of others. We solve
these issues by iteratively and efficiently updating the learning curves as
more data is acquired. We evaluate Slice Tuner on real datasets using
crowdsourcing for data acquisition and show that Slice Tuner significantly
outperforms baselines in terms of model accuracy and fairness, even when the
learning curves cannot be reliably estimated.
Related papers
- Data Acquisition for Improving Model Fairness using Reinforcement Learning [3.3916160303055563]
We focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness.
We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire.
We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
arXiv Detail & Related papers (2024-12-04T03:56:54Z) - Compute-Constrained Data Selection [77.06528009072967]
We find that many powerful data selection methods are almost never compute-optimal.
For compute-optimal training, we find that perplexity and gradient data selection require training-to-selection model size ratios of 5x and 10x, respectively.
arXiv Detail & Related papers (2024-10-21T17:11:21Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning [0.0]
We propose the CHG (compound of Hardness and Gradient) utility function, which approximates the utility of each data subset on model performance in every training epoch.
By deriving the closed-form Shapley value for each data point using the CHG utility function, we reduce the computational complexity to that of a single model retraining.
We further leverage CHG Shapley for real-time data selection, conducting experiments across three settings: standard datasets, label noise datasets, and class imbalance datasets.
arXiv Detail & Related papers (2024-06-17T16:48:31Z) - Certain and Approximately Certain Models for Statistical Learning [4.318959672085627]
We show that it is possible to learn accurate models directly from data with missing values for certain training data and target models.
We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary.
arXiv Detail & Related papers (2024-02-27T22:49:33Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - CLIP: Train Faster with Less Data [3.2575001434344286]
Deep learning models require an enormous amount of data for training.
Recently there is a shift in machine learning from model-centric to data-centric approaches.
We propose CLIP i.e., Curriculum Learning with Iterative data Pruning.
arXiv Detail & Related papers (2022-12-02T21:29:48Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.