Related papers: Tradeoffs in Resampling and Filtering for Imbalanced Classification

Tradeoffs in Resampling and Filtering for Imbalanced Classification

URL: http://arxiv.org/abs/2209.00127v1
Date: Wed, 31 Aug 2022 21:40:47 GMT
Title: Tradeoffs in Resampling and Filtering for Imbalanced Classification
Authors: Ryan Muther, David Smith
Abstract summary: We show that different methods of selecting training data bring tradeoffs in effectiveness and efficiency. We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model performance as selecting training data.
Score: 2.3605348648054454
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Imbalanced classification problems are extremely common in natural language processing and are solved using a variety of resampling and filtering techniques, which often involve making decisions on how to select training data or decide which test examples should be labeled by the model. We examine the tradeoffs in model performance involved in choices of training sample and filter training and test data in heavily imbalanced token classification task and examine the relationship between the magnitude of these tradeoffs and the base rate of the phenomenon of interest. In experiments on sequence tagging to detect rare phenomena in English and Arabic texts, we find that different methods of selecting training data bring tradeoffs in effectiveness and efficiency. We also see that in highly imbalanced cases, filtering test data using first-pass retrieval models is as important for model performance as selecting training data. The base rate of a rare positive class has a clear effect on the magnitude of the changes in performance caused by the selection of training or test data. As the base rate increases, the differences brought about by those choices decreases.

Related papers

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws [67.59263833387536]
ScalingFilter is a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations.
arXiv Detail & Related papers (2024-08-15T17:59:30Z)
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem. We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Outlier Detection as Instance Selection Method for Feature Selection in Time Series Classification [0.0]
Filter instances provided to feature selection methods for rare instances. For some data sets, the resulting increase in performance was only a few percent. For other datasets, we were able to achieve increases in performance of up to 16 percent.
arXiv Detail & Related papers (2021-11-16T14:44:33Z)
An Empirical Study on the Joint Impact of Feature Selection and Data Resampling on Imbalance Classification [4.506770920842088]
This study focuses on the synergy between feature selection and data resampling for imbalance classification. We conduct a large amount of experiments on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms.
arXiv Detail & Related papers (2021-09-01T06:01:51Z)
Unsupervised neural adaptation model based on optimal transport for spoken language identification [54.96267179988487]
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. We propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID.
arXiv Detail & Related papers (2020-12-24T07:37:19Z)
Message Passing Adaptive Resonance Theory for Online Active Semi-supervised Learning [30.19936050747407]
We propose Message Passing Adaptive Resonance Theory (MPART) for online active semi-supervised learning. MPART infers the class of unlabeled data and selects informative and representative samples through message passing between nodes on the topological graph. We evaluate our model with comparable query selection strategies and frequencies, showing that MPART significantly outperforms the competitive models in online active learning environments.
arXiv Detail & Related papers (2020-12-02T14:14:42Z)
Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals. We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.