DIWIFT: Discovering Instance-wise Influential Features for Tabular Data
- URL: http://arxiv.org/abs/2207.02773v1
- Date: Wed, 6 Jul 2022 16:07:46 GMT
- Title: DIWIFT: Discovering Instance-wise Influential Features for Tabular Data
- Authors: Pengxiang Cheng, Hong Zhu, Xing Tang, Dugang Liu, Yanyu Chen, Xiaoting
Wang, Weike Pan, Zhong Ming, Xiuqiang He
- Abstract summary: Tabular data is one of the most common data storage formats in business applications, ranging from retail, bank and E-commerce.
One of the critical problems in learning tabular data is to distinguish influential features from all the predetermined features.
We propose a novel method for discovering instance-wise influential features for tabular data (DIWIFT)
Our method minimizes the validation loss on the validation set and is thus more robust to the distribution shift existing in the training dataset and test dataset.
- Score: 29.69737486124891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data is one of the most common data storage formats in business
applications, ranging from retail, bank and E-commerce. These applications rely
heavily on machine learning models to achieve business success. One of the
critical problems in learning tabular data is to distinguish influential
features from all the predetermined features. Global feature selection has been
well-studied for quite some time, assuming that all instances have the same
influential feature subsets. However, different instances rely on different
feature subsets in practice, which also gives rise to that instance-wise
feature selection receiving increasing attention in recent studies. In this
paper, we first propose a novel method for discovering instance-wise
influential features for tabular data (DIWIFT), the core of which is to
introduce the influence function to measure the importance of an instance-wise
feature. DIWIFT is capable of automatically discovering influential feature
subsets of different sizes in different instances, which is different from
global feature selection that considers all instances with the same influential
feature subset. On the other hand, different from the previous instance-wise
feature selection, DIWIFT minimizes the validation loss on the validation set
and is thus more robust to the distribution shift existing in the training
dataset and test dataset, which is important in tabular data. Finally, we
conduct extensive experiments on both synthetic and real-world datasets to
validate the effectiveness of our DIWIFT, compared it with baseline methods.
Moreover, we also demonstrate the robustness of our method via some ablation
experiments.
Related papers
- LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - MvFS: Multi-view Feature Selection for Recommender System [7.0190343591422115]
We propose Multi-view Feature Selection (MvFS), which selects informative features for each instance more effectively.
MvFS employs a multi-view network consisting of multiple sub-networks, each of which learns to measure the feature importance of a part of data.
MvFS adopts an effective importance score modeling strategy which is applied independently to each field.
arXiv Detail & Related papers (2023-09-05T09:06:34Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Beyond Importance Scores: Interpreting Tabular ML by Visualizing Feature
Semantics [17.410093908967976]
Interpretability is becoming an active research topic as machine learning (ML) models are more widely used to make critical decisions.
Much of the existing interpretability methods used for tabular data only report feature-importance scores.
We address this limitation by introducing Feature Vectors, a new global interpretability method.
arXiv Detail & Related papers (2021-11-10T19:42:33Z) - Active Learning by Acquiring Contrastive Examples [8.266097781813656]
We propose an acquisition function that opts for selecting textitcontrastive examples, i.e. data points that are similar in the model feature space.
We compare our approach with a diverse set of acquisition functions in four natural language understanding tasks and seven datasets.
arXiv Detail & Related papers (2021-09-08T16:40:18Z) - A User-Guided Bayesian Framework for Ensemble Feature Selection in Life
Science Applications (UBayFS) [0.0]
We propose UBayFS, an ensemble feature selection technique, embedded in a Bayesian statistical framework.
Our approach enhances the feature selection process by considering two sources of information: data and domain knowledge.
A comparison with standard feature selectors underlines that UBayFS achieves competitive performance, while providing additional flexibility to incorporate domain knowledge.
arXiv Detail & Related papers (2021-04-30T06:51:33Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z) - Meta Learning for Causal Direction [29.00522306460408]
We introduce a novel generative model that allows distinguishing cause and effect in the small data setting.
We demonstrate our method on various synthetic as well as real-world data and show that it is able to maintain high accuracy in detecting directions across varying dataset sizes.
arXiv Detail & Related papers (2020-07-06T15:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.