TabFSBench: Tabular Benchmark for Feature Shifts in Open Environment
- URL: http://arxiv.org/abs/2501.18935v2
- Date: Thu, 20 Feb 2025 15:54:43 GMT
- Title: TabFSBench: Tabular Benchmark for Feature Shifts in Open Environment
- Authors: Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo,
- Abstract summary: Tabular data is widely utilized in various machine learning tasks.
Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, have garnered limited attention.
This paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench)
TabFSBench is released for public access by using a few lines of Python codes at https://github.com/LAMDASZ-ML/TabFSBench.
- Score: 45.368146581808276
- License:
- Abstract: Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation. TabFSBench is released for public access by using a few lines of Python codes at https://github.com/LAMDASZ-ML/TabFSBench.
Related papers
- Fully Test-time Adaptation for Tabular Data [48.67303250592189]
We propose the Fully Test-time Adaptation for Tabular data, which enables FTTA methods to robustly optimize the label distribution of predictions.
We conduct comprehensive experiments on six benchmark datasets, which are evaluated using three metrics.
arXiv Detail & Related papers (2024-12-14T15:49:53Z) - TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks [30.922069185335246]
We find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature.
A considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines.
This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Benchmarking Distribution Shift in Tabular Data with TableShift [32.071534049494076]
TableShift is a distribution shift benchmark for tabular data.
It covers domains including finance, education, public policy, healthcare, and civic participation.
We conduct a large-scale study comparing several state-of-the-art data models alongside robust learning and domain generalization methods.
arXiv Detail & Related papers (2023-12-10T18:19:07Z) - Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective [71.45945607871715]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
The core idea is to embed data instances into a shared feature space, where each instance is represented by its distance to a fixed number of nearest neighbors and their labels.
Extensive experiments on 101 datasets confirm TabPTM's effectiveness in both classification and regression tasks, with and without fine-tuning.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model
in Data Science [16.384705926693073]
This study seeks to extend the power of pretraining methodologies to facilitate the prediction over tables in data science.
We introduce UniTabE, a method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures.
In order to implement the pretraining phase, we curated an expansive dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform.
arXiv Detail & Related papers (2023-07-18T13:28:31Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - XTab: Cross-table Pretraining for Tabular Transformers [29.419276738753968]
XTab is a framework for cross-table pretraining of tabular transformers on datasets from various domains.
We show that XTab consistently boosts the generalizability, learning speed, and performance of multiple tabular transformers.
We achieve superior performance than other state-of-the-art tabular deep learning models on various tasks such as regression, binary, and multiclass classification.
arXiv Detail & Related papers (2023-05-10T12:17:52Z) - STUNT: Few-shot Tabular Learning with Self-generated Tasks from
Unlabeled Tables [64.0903766169603]
We propose a framework for few-shot semi-supervised learning, coined Self-generated Tasks from UNlabeled Tables (STUNT)
Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label.
We then employ a meta-learning scheme to learn generalizable knowledge with the constructed tasks.
arXiv Detail & Related papers (2023-03-02T02:37:54Z) - DIWIFT: Discovering Instance-wise Influential Features for Tabular Data [29.69737486124891]
Tabular data is one of the most common data storage formats in business applications, ranging from retail, bank and E-commerce.
One of the critical problems in learning tabular data is to distinguish influential features from all the predetermined features.
We propose a novel method for discovering instance-wise influential features for tabular data (DIWIFT)
Our method minimizes the validation loss on the validation set and is thus more robust to the distribution shift existing in the training dataset and test dataset.
arXiv Detail & Related papers (2022-07-06T16:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.