Discovering outstanding subgroup lists for numeric targets using MDL
- URL: http://arxiv.org/abs/2006.09186v1
- Date: Tue, 16 Jun 2020 14:29:52 GMT
- Title: Discovering outstanding subgroup lists for numeric targets using MDL
- Authors: Hugo M. Proen\c{c}a, Peter Gr\"unwald, Thomas B\"ack, Matthijs van
Leeuwen
- Abstract summary: We propose an algorithm for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists.
We show that our formalization coincides with an existing quality measure when finding a single subgroup.
We next propose SSD++, an algorithm for which we empirically demonstrate that it returns outstanding subgroup lists.
- Score: 0.34410212782758054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of subgroup discovery (SD) is to find interpretable descriptions of
subsets of a dataset that stand out with respect to a target attribute. To
address the problem of mining large numbers of redundant subgroups, subgroup
set discovery (SSD) has been proposed. State-of-the-art SSD methods have their
limitations though, as they typically heavily rely on heuristics and/or
user-chosen hyperparameters.
We propose a dispersion-aware problem formulation for subgroup set discovery
that is based on the minimum description length (MDL) principle and subgroup
lists. We argue that the best subgroup list is the one that best summarizes the
data given the overall distribution of the target. We restrict our focus to a
single numeric target variable and show that our formalization coincides with
an existing quality measure when finding a single subgroup, but that-in
addition-it allows to trade off subgroup quality with the complexity of the
subgroup. We next propose SSD++, a heuristic algorithm for which we empirically
demonstrate that it returns outstanding subgroup lists: non-redundant sets of
compact subgroups that stand out by having strongly deviating means and small
spread.
Related papers
- Using Constraints to Discover Sparse and Alternative Subgroup Descriptions [0.0]
Subgroup-discovery methods allow users to obtain simple descriptions of interesting regions in a dataset.
We focus on two types of constraints: First, we limit the number of features used in subgroup descriptions, making the latter sparse.
Second, we propose the novel optimization problem of finding alternative subgroup descriptions, which cover a similar set of data objects as a given subgroup but use different features.
arXiv Detail & Related papers (2024-06-03T15:10:01Z) - Discover and Mitigate Multiple Biased Subgroups in Image Classifiers [45.96784278814168]
Machine learning models can perform well on in-distribution data but often fail on biased subgroups that are underrepresented in the training data.
We propose Decomposition, Interpretation, and Mitigation (DIM) to address this problem.
Our approach decomposes the image features into multiple components that represent multiple subgroups.
arXiv Detail & Related papers (2024-03-19T14:44:54Z) - Subgroup Discovery in MOOCs: A Big Data Application for Describing Different Types of Learners [0.0]
This paper aims to categorize and describe different types of learners in massive open online courses (MOOCs) by means of a subgroup discovery approach based on MapReduce.
The proposed subgroup discovery approach considers emerging parallel methodologies like MapReduce to be able to cope with extremely large datasets.
arXiv Detail & Related papers (2024-02-10T16:07:38Z) - Identification of Systematic Errors of Image Classifiers on Rare
Subgroups [12.064692111429494]
systematic errors can impact both fairness for demographic minority groups as well as robustness and safety under domain shift.
We leverage recent advances in text-to-image models and search in the space of textual descriptions of subgroups ("prompts") for subgroups where the target model has low performance.
We study subgroup coverage and identifiability with PromptAttack in a controlled setting and find that it identifies systematic errors with high accuracy.
arXiv Detail & Related papers (2023-03-09T07:08:25Z) - Outlier-Robust Group Inference via Gradient Space Clustering [50.87474101594732]
Existing methods can improve the worst-group performance, but they require group annotations, which are often expensive and sometimes infeasible to obtain.
We address the problem of learning group annotations in the presence of outliers by clustering the data in the space of gradients of the model parameters.
We show that data in the gradient space has a simpler structure while preserving information about minority groups and outliers, making it suitable for standard clustering methods like DBSCAN.
arXiv Detail & Related papers (2022-10-13T06:04:43Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Focus on the Common Good: Group Distributional Robustness Follows [47.62596240492509]
This paper proposes a new and simple algorithm that explicitly encourages learning of features that are shared across various groups.
While Group-DRO focuses on groups with worst regularized loss, focusing instead, on groups that enable better performance even on other groups, could lead to learning of shared/common features.
arXiv Detail & Related papers (2021-10-06T09:47:41Z) - Just Train Twice: Improving Group Robustness without Training Group
Information [101.84574184298006]
Standard training via empirical risk minimization can produce models that achieve high accuracy on average but low accuracy on certain groups.
Prior approaches that achieve high worst-group accuracy, like group distributionally robust optimization (group DRO) require expensive group annotations for each training point.
We propose a simple two-stage approach, JTT, that first trains a standard ERM model for several epochs, and then trains a second model that upweights the training examples that the first model misclassified.
arXiv Detail & Related papers (2021-07-19T17:52:32Z) - Learning Multi-Attention Context Graph for Group-Based Re-Identification [214.84551361855443]
Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance.
In this work, we consider employing context information for identifying groups of people, i.e., group re-id.
We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks.
arXiv Detail & Related papers (2021-04-29T09:57:47Z) - Robust subgroup discovery [0.2578242050187029]
We formalize the problem of optimal robust subgroup discovery using the Minimum Description Length principle.
We propose RSD, a greedy greedy that finds good subgroup lists and guarantees that the most significant subgroup is added in each iteration.
We empirically show on 54 datasets that RSD outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.
arXiv Detail & Related papers (2021-03-25T09:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.