Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces
- URL: http://arxiv.org/abs/2311.10051v1
- Date: Thu, 16 Nov 2023 17:45:59 GMT
- Title: Tabular Few-Shot Generalization Across Heterogeneous Feature Spaces
- Authors: Max Zhu, Katarzyna Kobalczyk, Andrija Petrovic, Mladen Nikolic,
Mihaela van der Schaar, Boris Delibasic, Petro Lio
- Abstract summary: We propose a novel approach to few-shot learning involving knowledge sharing between datasets with heterogeneous feature spaces.
FLAT learns low-dimensional embeddings of datasets and their individual columns, which facilitate knowledge transfer and generalization to previously unseen datasets.
A decoder network parametrizes the predictive target network, implemented as a Graph Attention Network, to accommodate the heterogeneous nature of tabular datasets.
- Score: 43.67453625260335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the prevalence of tabular datasets, few-shot learning remains
under-explored within this domain. Existing few-shot methods are not directly
applicable to tabular datasets due to varying column relationships, meanings,
and permutational invariance. To address these challenges, we propose FLAT-a
novel approach to tabular few-shot learning, encompassing knowledge sharing
between datasets with heterogeneous feature spaces. Utilizing an encoder
inspired by Dataset2Vec, FLAT learns low-dimensional embeddings of datasets and
their individual columns, which facilitate knowledge transfer and
generalization to previously unseen datasets. A decoder network parametrizes
the predictive target network, implemented as a Graph Attention Network, to
accommodate the heterogeneous nature of tabular datasets. Experiments on a
diverse collection of 118 UCI datasets demonstrate FLAT's successful
generalization to new tabular datasets and a considerable improvement over the
baselines.
Related papers
- Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications.
Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning.
We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z) - A Closer Look at TabPFN v2: Strength, Limitation, and Extension [51.08999772842298]
Tabular Prior-data Fitted Network v2 (TabPFN v2) achieves unprecedented in-context learning accuracy across multiple datasets.
In this paper, we evaluate TabPFN v2 on over 300 datasets, confirming its exceptional generalization capabilities on small- to medium-scale tasks.
arXiv Detail & Related papers (2025-02-24T17:38:42Z) - Geodesic Flow Kernels for Semi-Supervised Learning on Mixed-Variable Tabular Dataset [31.23513370504603]
GFTab is a semi-Supervised Learning on Mixed-Variable Tabular dataset framework.
GFTab incorporates three key innovations: 1) Variable-specific corruption methods tailored to the distinct properties of continuous and categorical variables, 2) A Geodesic flow kernel based similarity measure to capture geometric changes between corrupted inputs, and 3) Tree-based embedding to leverage hierarchical relationships from available labeled data.
Our experimental results show that GFTab outperforms existing ML/DL models across many of these datasets, particularly in settings with limited labeled data.
arXiv Detail & Related papers (2024-12-17T12:47:53Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Cross-Table Pretraining towards a Universal Function Space for Heterogeneous Tabular Data [35.61663559675556]
Cross-dataset pretraining has shown notable success in various fields.
In this study, we introduce a cross-table pretrained Transformer, XTFormer, for versatile downstream tabular prediction tasks.
Our methodology is pretraining XTFormer to establish a "meta-function" space that encompasses all potential feature-target mappings.
arXiv Detail & Related papers (2024-06-01T03:24:31Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - TablEye: Seeing small Tables through the Lens of Images [1.4398570436349933]
We propose an innovative framework called TablEye, which aims to overcome the limit of forming prior knowledge for tabular data by adopting domain transformation.
This approach harnesses rigorously tested few-shot learning algorithms and embedding functions to acquire and apply prior knowledge.
TalEye demonstrated a superior performance by outstripping the TabLLM in a 4-shot task with a maximum 0.11 AUC and a STUNT in a 1- shot setting, where it led on average by 3.17% accuracy.
arXiv Detail & Related papers (2023-07-04T02:45:59Z) - Learning Representations without Compositional Assumptions [79.12273403390311]
We propose a data-driven approach that learns feature set dependencies by representing feature sets as graph nodes and their relationships as learnable edges.
We also introduce LEGATO, a novel hierarchical graph autoencoder that learns a smaller, latent graph to aggregate information from multiple views dynamically.
arXiv Detail & Related papers (2023-05-31T10:36:10Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.