Scaling Experiments in Self-Supervised Cross-Table Representation
Learning
- URL: http://arxiv.org/abs/2309.17339v1
- Date: Fri, 29 Sep 2023 15:48:38 GMT
- Title: Scaling Experiments in Self-Supervised Cross-Table Representation
Learning
- Authors: Maximilian Schambach, Dominique Paul, Johannes S. Otterbach
- Abstract summary: We introduce a novel Transformer-based architecture specifically tailored to tabular data and cross-table representation learning.
Our training approach encompasses both single-table and cross-table models, trained via missing value imputation.
To understand the scaling behavior of our method, we train models of varying sizes, ranging from approximately $104$ to $107$ parameters.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To analyze the scaling potential of deep tabular representation learning
models, we introduce a novel Transformer-based architecture specifically
tailored to tabular data and cross-table representation learning by utilizing
table-specific tokenizers and a shared Transformer backbone. Our training
approach encompasses both single-table and cross-table models, trained via
missing value imputation through a self-supervised masked cell recovery
objective. To understand the scaling behavior of our method, we train models of
varying sizes, ranging from approximately $10^4$ to $10^7$ parameters. These
models are trained on a carefully curated pretraining dataset, consisting of
135M training tokens sourced from 76 diverse datasets. We assess the scaling of
our architecture in both single-table and cross-table pretraining setups by
evaluating the pretrained models using linear probing on a curated set of
benchmark datasets and comparing the results with conventional baselines.
Related papers
- Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications [9.457938949410583]
TabRepo is a new dataset of model evaluations and predictions.
It contains the predictions and metrics of 1310 models evaluated on 200 datasets.
arXiv Detail & Related papers (2023-11-06T09:17:18Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Retrieval-Based Transformer for Table Augmentation [14.460363647772745]
We introduce a novel approach toward automatic data wrangling.
We aim to address table augmentation tasks, including row/column population and data imputation.
Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
arXiv Detail & Related papers (2023-06-20T18:51:21Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z) - Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel
Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models.
The weights are determined by a data-driven way and depend on the similarity between the corresponding functions.
We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.