Robust subgroup discovery
- URL: http://arxiv.org/abs/2103.13686v1
- Date: Thu, 25 Mar 2021 09:04:13 GMT
- Title: Robust subgroup discovery
- Authors: Hugo Manuel Proen\c{c}a, Thomas B\"ack, Matthijs van Leeuwen
- Abstract summary: We formalize the problem of optimal robust subgroup discovery using the Minimum Description Length principle.
We propose RSD, a greedy greedy that finds good subgroup lists and guarantees that the most significant subgroup is added in each iteration.
We empirically show on 54 datasets that RSD outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.
- Score: 0.2578242050187029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global perspective.
First, we formulate a broad model class of subgroup lists, i.e., ordered sets
of subgroups, for univariate and multivariate targets that can consist of
nominal or numeric variables. This novel model class allows us to formalize the
problem of optimal robust subgroup discovery using the Minimum Description
Length (MDL) principle, where we resort to optimal Normalized Maximum
Likelihood and Bayesian encodings for nominal and numeric targets,
respectively. Notably, we show that our problem definition is equal to mining
the top-1 subgroup with an information-theoretic quality measure plus a penalty
for complexity.
Second, as finding optimal subgroup lists is NP-hard, we propose RSD, a
greedy heuristic that finds good subgroup lists and guarantees that the most
significant subgroup found according to the MDL criterion is added in each
iteration, which is shown to be equivalent to a Bayesian one-sample
proportions, multinomial, or t-test between the subgroup and dataset marginal
target distributions plus a multiple hypothesis testing penalty. We empirically
show on 54 datasets that RSD outperforms previous subgroup set discovery
methods in terms of quality and subgroup list size.
Related papers
- Using Constraints to Discover Sparse and Alternative Subgroup Descriptions [0.0]
Subgroup-discovery methods allow users to obtain simple descriptions of interesting regions in a dataset.
We focus on two types of constraints: First, we limit the number of features used in subgroup descriptions, making the latter sparse.
Second, we propose the novel optimization problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features.
arXiv Detail & Related papers (2024-06-03T15:10:01Z) - Discover and Mitigate Multiple Biased Subgroups in Image Classifiers [45.96784278814168]
Machine learning models can perform well on in-distribution data but often fail on biased subgroups that are underrepresented in the training data.
We propose Decomposition, Interpretation, and Mitigation (DIM) to address this problem.
Our approach decomposes the image features into multiple components that represent multiple subgroups.
arXiv Detail & Related papers (2024-03-19T14:44:54Z) - Clustered Orienteering Problem with Subgroups [6.961946145048321]
Clustered Orienteering Problem with Subgroups (COPS)
We show that our new formulation has the ability to model and solve two previous well-known variants.
arXiv Detail & Related papers (2023-12-26T18:28:25Z) - Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions.
We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z) - Multi-Group Fairness Evaluation via Conditional Value-at-Risk Testing [24.553384023323332]
We propose an approach to test for performance disparities based on Conditional Value-at-Risk.
We show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups.
arXiv Detail & Related papers (2023-12-06T19:25:32Z) - Concomitant Group Testing [49.50984893039441]
We introduce a variation of the group testing problem capturing the idea that a positive test requires a combination of multiple types'' of item.
The goal is to reliably identify all of the semi-defective sets using as few tests as possible.
Our algorithms are distinguished by (i) whether they are deterministic (zero-error) or randomized (small-error), and (ii) whether they are non-adaptive, fully adaptive, or have limited adaptivity.
arXiv Detail & Related papers (2023-09-08T09:11:12Z) - Focus on the Common Good: Group Distributional Robustness Follows [47.62596240492509]
This paper proposes a new and simple algorithm that explicitly encourages learning of features that are shared across various groups.
While Group-DRO focuses on groups with worst regularized loss, focusing instead, on groups that enable better performance even on other groups, could lead to learning of shared/common features.
arXiv Detail & Related papers (2021-10-06T09:47:41Z) - Just Train Twice: Improving Group Robustness without Training Group
Information [101.84574184298006]
Standard training via empirical risk minimization can produce models that achieve high accuracy on average but low accuracy on certain groups.
Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point.
We propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified.
arXiv Detail & Related papers (2021-07-19T17:52:32Z) - GroupifyVAE: from Group-based Definition to VAE-based Unsupervised
Representation Disentanglement [91.9003001845855]
VAE-based unsupervised disentanglement can not be achieved without introducing other inductive bias.
We address VAE-based unsupervised disentanglement by leveraging the constraints derived from the Group Theory based definition as the non-probabilistic inductive bias.
We train 1800 models covering the most prominent VAE-based models on five datasets to verify the effectiveness of our method.
arXiv Detail & Related papers (2021-02-20T09:49:51Z) - Discovering outstanding subgroup lists for numeric targets using MDL [0.34410212782758054]
We propose an algorithm for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists.
We show that our formalization coincides with an existing quality measure when finding a single subgroup.
We next propose SSD++, an algorithm for which we empirically demonstrate that it returns outstanding subgroup lists.
arXiv Detail & Related papers (2020-06-16T14:29:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.