DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets
- URL: http://arxiv.org/abs/2308.09878v2
- Date: Tue, 22 Aug 2023 02:32:01 GMT
- Title: DatasetEquity: Are All Samples Created Equal? In The Quest For Equity
Within Datasets
- Authors: Shubham Shrivastava, Xianling Zhang, Sushruth Nagesh, Armin Parchami
- Abstract summary: This paper presents a novel method for addressing data imbalance in machine learning.
It computes sample likelihoods based on image appearance using deep perceptual embeddings and clustering.
It then uses these likelihoods to weigh samples differently during training with a proposed $bfGeneralized Focal Loss$ function.
- Score: 4.833815605196965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data imbalance is a well-known issue in the field of machine learning,
attributable to the cost of data collection, the difficulty of labeling, and
the geographical distribution of the data. In computer vision, bias in data
distribution caused by image appearance remains highly unexplored. Compared to
categorical distributions using class labels, image appearance reveals complex
relationships between objects beyond what class labels provide. Clustering deep
perceptual features extracted from raw pixels gives a richer representation of
the data. This paper presents a novel method for addressing data imbalance in
machine learning. The method computes sample likelihoods based on image
appearance using deep perceptual embeddings and clustering. It then uses these
likelihoods to weigh samples differently during training with a proposed
$\textbf{Generalized Focal Loss}$ function. This loss can be easily integrated
with deep learning algorithms. Experiments validate the method's effectiveness
across autonomous driving vision datasets including KITTI and nuScenes. The
loss function improves state-of-the-art 3D object detection methods, achieving
over $200\%$ AP gains on under-represented classes (Cyclist) in the KITTI
dataset. The results demonstrate the method is generalizable, complements
existing techniques, and is particularly beneficial for smaller datasets and
rare classes. Code is available at:
https://github.com/towardsautonomy/DatasetEquity
Related papers
- Outlier Detection in Large Radiological Datasets using UMAP [1.206248959194646]
In biomedical data, variations in image quality, labeling, reports, and archiving can lead to errors, inconsistencies, and repeated samples.
Here, we show that the uniform manifold approximation and projection algorithm can find these anomalies essentially by forming independent clusters.
While the results are archival and retrospective, the graph-based methods work for any data type and will prove equally beneficial for curation at the time of dataset creation.
arXiv Detail & Related papers (2024-07-31T00:56:06Z) - Comparing Importance Sampling Based Methods for Mitigating the Effect of
Class Imbalance [0.0]
We compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling.
We find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes.
Our findings also indicate that there may exist some redundancy in data in the Planet dataset.
arXiv Detail & Related papers (2024-02-28T22:52:27Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Silhouettes and quasi residual plots for neural nets and tree-based
classifiers [0.0]
Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data.
An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to different class.
The graphical displays are illustrated and interpreted on benchmark data sets containing images, mixed features, and tweets.
arXiv Detail & Related papers (2021-06-16T14:26:31Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - MAGNeto: An Efficient Deep Learning Method for the Extractive Tags
Summarization Problem [0.0]
We study a new image annotation task named Extractive Tags Summarization (ETS)
The goal is to extract important tags from the context lying in an image and its corresponding tags.
Our proposed solution consists of different widely used blocks like convolutional and self-attention layers.
arXiv Detail & Related papers (2020-11-09T11:34:21Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.