Tradeoffs in Resampling and Filtering for Imbalanced Classification
- URL: http://arxiv.org/abs/2209.00127v1
- Date: Wed, 31 Aug 2022 21:40:47 GMT
- Title: Tradeoffs in Resampling and Filtering for Imbalanced Classification
- Authors: Ryan Muther, David Smith
- Abstract summary: We show that different methods of selecting training data bring tradeoffs in effectiveness and efficiency.
We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model performance as selecting training data.
- Score: 2.3605348648054454
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Imbalanced classification problems are extremely common in natural language
processing and are solved using a variety of resampling and filtering
techniques, which often involve making decisions on how to select training data
or decide which test examples should be labeled by the model. We examine the
tradeoffs in model performance involved in choices of training sample and
filter training and test data in heavily imbalanced token classification task
and examine the relationship between the magnitude of these tradeoffs and the
base rate of the phenomenon of interest. In experiments on sequence tagging to
detect rare phenomena in English and Arabic texts, we find that different
methods of selecting training data bring tradeoffs in effectiveness and
efficiency. We also see that in highly imbalanced cases, filtering test data
using first-pass retrieval models is as important for model performance as
selecting training data. The base rate of a rare positive class has a clear
effect on the magnitude of the changes in performance caused by the selection
of training or test data. As the base rate increases, the differences brought
about by those choices decreases.
Related papers
- ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data.
To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z) - The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data.
Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem.
We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Outlier Detection as Instance Selection Method for Feature Selection in
Time Series Classification [0.0]
Filter instances provided to feature selection methods for rare instances.
For some data sets, the resulting increase in performance was only a few percent.
For other datasets, we were able to achieve increases in performance of up to 16 percent.
arXiv Detail & Related papers (2021-11-16T14:44:33Z) - An Empirical Study on the Joint Impact of Feature Selection and Data
Resampling on Imbalance Classification [4.506770920842088]
This study focuses on the synergy between feature selection and data resampling for imbalance classification.
We conduct a large amount of experiments on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms.
arXiv Detail & Related papers (2021-09-01T06:01:51Z) - Unsupervised neural adaptation model based on optimal transport for
spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded.
We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z) - Message Passing Adaptive Resonance Theory for Online Active
Semi-supervised Learning [30.19936050747407]
We propose Message Passing Adaptive Resonance Theory (MPART) for online active semi-supervised learning.
MPART infers the class of unlabeled data and selects informative and representative samples through message passing between nodes on the topological graph.
We evaluate our model with comparable query selection strategies and frequencies, showing that MPART significantly outperforms the competitive models in online active learning environments.
arXiv Detail & Related papers (2020-12-02T14:14:42Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.