Related papers: MISFEAT: Feature Selection for Subgroups with Systematic Missing Data

MISFEAT: Feature Selection for Subgroups with Systematic Missing Data

URL: http://arxiv.org/abs/2412.06711v1
Date: Mon, 09 Dec 2024 17:59:59 GMT
Title: MISFEAT: Feature Selection for Subgroups with Systematic Missing Data
Authors: Bar Genossar, Thinh On, Md. Mouinul Islam, Ben Eliav, Senjuti Basu Roy, Avigdor Gal,
Abstract summary: We address the challenge of systematic missing data, a scenario in which some feature values are missing for alls of a subgroup.<n>Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable.<n>We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections.
Score: 8.063972429611365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the problem of selecting features for datasets that can be naturally partitioned into subgroups (e.g., according to socio-demographic groups and age), each with its own dominant set of features. Within this subgroup-oriented framework, we address the challenge of systematic missing data, a scenario in which some feature values are missing for all tuples of a subgroup, due to flawed data integration, regulatory constraints, or privacy concerns. Feature selection is governed by finding mutual Information, a popular quantification of correlation, between features and a target variable. Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable. In the presence of systematic missing data, the closed form of mutual information could not simply be applied. We argue that in such a setting, leveraging relationships between available feature mutual information within a subgroup or across subgroups can assist inferring missing mutual information values. We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections by modeling it as a multiplex graph, and employing information propagation between its nodes. We address two distinct scalability challenges related to training and propose principled solutions to tackle them. Through an extensive empirical evaluation, we demonstrate the efficacy of the proposed solutions both qualitatively and running time wise.

Related papers

GOLFS: Feature Selection via Combining Both Global and Local Information for High Dimensional Clustering [10.740524877905685]
We propose a new unsupervised feature selection method, named GlObal and Local information combined Feature Selection (GOLFS)<n>GOLFS combines both local geometric structure via manifold learning and global correlation structure of samples to select the discriminative features.<n>The combination improves the accuracy of both feature selection and clustering by exploiting more comprehensive information.
arXiv Detail & Related papers (2025-07-15T03:39:07Z)
A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random [3.7967162203679155]
This paper introduces a unified framework designed to address these challenges simultaneously.<n>Our approach incorporates a data-driven penalty matrix into clustering to enable more flexible variable selection.<n>We demonstrate that, under certain regularity, the proposed framework achieves both clustering consistency and consistency, even in the presence of missing data.
arXiv Detail & Related papers (2025-05-25T11:08:43Z)
Flexible inference in heterogeneous and attributed multilayer networks [21.349513661012498]
We develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information. We demonstrate its ability to unveil a variety of patterns in a social support network among villagers in rural India.
arXiv Detail & Related papers (2024-05-31T15:21:59Z)
A data-science pipeline to enable the Interpretability of Many-Objective Feature Selection [0.1474723404975345]
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions.
arXiv Detail & Related papers (2023-11-30T17:44:22Z)
Causal Feature Selection via Transfer Entropy [59.999594949050596]
Causal discovery aims to identify causal relationships between features with observational data. We introduce a new causal feature selection approach that relies on the forward and backward feature selection procedures. We provide theoretical guarantees on the regression and classification errors for both the exact and the finite-sample cases.
arXiv Detail & Related papers (2023-10-17T08:04:45Z)
Composite Feature Selection using Deep Ensembles [130.72015919510605]
We investigate the problem of discovering groups of predictive features without predefined grouping. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups. We propose a new metric to measure similarity between discovered groups and the ground truth.
arXiv Detail & Related papers (2022-11-01T17:49:40Z)
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z)
Addressing Missing Sources with Adversarial Support-Matching [8.53946780558779]
We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected groups from algorithmic fairness, we refer to the partitions carved by this second level as "subgroups" We make use of an additional, diverse but unlabeled dataset, called the "deployment set", to learn a representation that is invariant to subgroup.
arXiv Detail & Related papers (2022-03-24T16:19:19Z)
Causal Scene BERT: Improving object detection by searching for challenging groups of data [125.40669814080047]
Computer vision applications rely on learning-based perception modules parameterized with neural networks for tasks like object detection. These modules frequently have low expected error overall but high error on atypical groups of data due to biases inherent in the training process. Our main contribution is a pseudo-automatic method to discover such groups in foresight by performing causal interventions on simulated scenes.
arXiv Detail & Related papers (2022-02-08T05:14:16Z)
Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions. We propose an algorithm that optimize for the worst-off group assignments from a constraint set. We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z)
A Framework for Multi-View Classification of Features [6.660458629649826]
In solving the data classification problems, when the feature set is too large, typical approaches will not be able to solve the problem. In this research, an innovative framework for multi-view ensemble classification, inspired by the problem of object recognition in the multiple views theory of humans, is proposed.
arXiv Detail & Related papers (2021-08-02T16:27:43Z)
Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics. We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data. Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.