PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization
- URL: http://arxiv.org/abs/2410.13516v1
- Date: Thu, 17 Oct 2024 13:05:44 GMT
- Title: PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization
- Authors: Marco Spinaci, Marek Polewczyk, Johannes Hoffart, Markus C. Kohler, Sam Thelin, Tassilo Klein,
- Abstract summary: We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing.
It can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks.
- Score: 7.036380633387952
- License:
- Abstract: Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.
Related papers
- TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training.
We propose a clustering-based approach for building ones satisfying all these criteria.
Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - Towards Cross-Table Masked Pretraining for Web Data Mining [22.952238405240188]
We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
arXiv Detail & Related papers (2023-07-10T02:27:38Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Embeddings for Tabular Data: A Survey [8.010589283146222]
Tabular data comprises rows (samples) with the same set of columns (attributes)
Tables are becoming the natural way of storing data among various industries and academia.
New line of research work applies various learning techniques to support various database tasks.
arXiv Detail & Related papers (2023-02-23T04:37:49Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - TURL: Table Understanding through Representation Learning [29.6016859927782]
TURL is a novel framework that introduces the pre-training/finetuning paradigm to relational Web tables.
During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner.
We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.
arXiv Detail & Related papers (2020-06-26T05:44:54Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.