Related papers: The Most Important Features in Generalized Additive Models Might Be Groups of Features

The Most Important Features in Generalized Additive Models Might Be Groups of Features

URL: http://arxiv.org/abs/2506.19937v1
Date: Tue, 24 Jun 2025 18:25:24 GMT
Title: The Most Important Features in Generalized Additive Models Might Be Groups of Features
Authors: Tomas M. Bosschieter, Luis Franca, Jessica Wolk, Yiyuan Wu, Bella Mehta, Joseph Dehoney, Orsolya Kiss, Fiona C. Baker, Qingyu Zhao, Rich Caruana, Kilian M. Pohl,
Abstract summary: This paper introduces a novel approach to determine the importance of a group of features for Generalized Additive Models (GAMs)<n>We showcase properties of our method on three synthetic experiments that illustrate the behavior of group importance across various data regimes.
Score: 10.324544560083543
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While analyzing the importance of features has become ubiquitous in interpretable machine learning, the joint signal from a group of related features is sometimes overlooked or inadvertently excluded. Neglecting the joint signal could bypass a critical insight: in many instances, the most significant predictors are not isolated features, but rather the combined effect of groups of features. This can be especially problematic for datasets that contain natural groupings of features, including multimodal datasets. This paper introduces a novel approach to determine the importance of a group of features for Generalized Additive Models (GAMs) that is efficient, requires no model retraining, allows defining groups posthoc, permits overlapping groups, and remains meaningful in high-dimensional settings. Moreover, this definition offers a parallel with explained variation in statistics. We showcase properties of our method on three synthetic experiments that illustrate the behavior of group importance across various data regimes. We then demonstrate the importance of groups of features in identifying depressive symptoms from a multimodal neuroscience dataset, and study the importance of social determinants of health after total hip arthroplasty. These two case studies reveal that analyzing group importance offers a more accurate, holistic view of the medical issues compared to a single-feature analysis.

Related papers

GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
Explaining Clustering of Ecological Momentary Assessment Data Through Temporal and Feature Attention [4.951599300340955]
Ecological Momentary Assessment (EMA) studies offer rich individual data on psychopathology-relevant variables in real-time. This paper proposes an attention-based interpretable framework to identify the important time-points and variables that play primary roles in distinguishing between clusters.
arXiv Detail & Related papers (2024-05-08T07:09:43Z)
Feature graphs for interpretable unsupervised tree ensembles: centrality, interaction, and application in disease subtyping [0.24578723416255746]
Feature selection assumes a pivotal role in enhancing model interpretability. The accuracy gained from aggregating decision trees comes at the expense of interpretability. The study introduces novel methods to construct feature graphs from unsupervised random forests.
arXiv Detail & Related papers (2024-04-27T12:47:37Z)
Feature Importance Disparities for Data Bias Investigations [2.184775414778289]
It is widely held that one cause of downstream bias in classifiers is bias present in the training data. We present one such method that given a dataset $X$ consisting of protected and unprotected features, outcomes $y$, and a regressor $h$ that predicts $y$ given $X$. We show across $4$ datasets and $4$ common feature importance methods of broad interest to the machine learning community that we can efficiently find subgroups with large FID values even over exponentially large subgroup classes.
arXiv Detail & Related papers (2023-03-03T04:12:04Z)
Composite Feature Selection using Deep Ensembles [130.72015919510605]
We investigate the problem of discovering groups of predictive features without predefined grouping. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups. We propose a new metric to measure similarity between discovered groups and the ground truth.
arXiv Detail & Related papers (2022-11-01T17:49:40Z)
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data [81.43750358586072]
We propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets.
arXiv Detail & Related papers (2022-10-24T08:57:55Z)
Ensemble feature selection with clustering for analysis of high-dimensional, correlated clinical data in the search for Alzheimer's disease biomarkers [0.0]
We present a novel framework to create feature selection ensembles from multivariate feature selectors. We take into account the biases produced by groups of correlated features, using agglomerative hierarchical clustering in a pre-processing step. These methods were applied to two real-world datasets from studies of Alzheimer's disease (AD), a progressive neurodegenerative disease that has no cure and is not yet fully understood.
arXiv Detail & Related papers (2022-07-06T01:03:50Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Grouped Feature Importance and Combined Features Effect Plot [2.15867006052733]
Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms. We provide a comprehensive overview of how existing model-agnostic techniques can be defined for feature groups to assess the grouped feature importance. We introduce the combined features effect plot, which is a technique to visualize the effect of a group of features based on a sparse, interpretable linear combination of features.
arXiv Detail & Related papers (2021-04-23T16:27:38Z)
Statistical Analytics and Regional Representation Learning for COVID-19 Pandemic Understanding [4.731074162093199]
The rapid spread of the novel coronavirus (COVID-19) has severely impacted almost all countries around the world. This paper combines and processes an extensive collection of publicly available datasets to provide a unified information source. A specific RNN-based inference pipeline called DoubleWindowLSTM-CP is proposed in this work for predictive event modeling.
arXiv Detail & Related papers (2020-08-08T03:35:16Z)
Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization. We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise. We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.