Towards Cross-Table Masked Pretraining for Web Data Mining
- URL: http://arxiv.org/abs/2307.04308v2
- Date: Thu, 1 Feb 2024 14:54:00 GMT
- Title: Towards Cross-Table Masked Pretraining for Web Data Mining
- Authors: Chao Ye, Guoshan Lu, Haobo Wang, Liyao Li, Sai Wu, Gang Chen, Junbo
Zhao
- Abstract summary: We propose an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2.
Our experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
- Score: 22.952238405240188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data pervades the landscape of the World Wide Web, playing a
foundational role in the digital architecture that underpins online
information. Given the recent influence of large-scale pretrained models like
ChatGPT and SAM across various domains, exploring the application of
pretraining techniques for mining tabular data on the web has emerged as a
highly promising research direction. Indeed, there have been some recent works
around this topic where most (if not all) of them are limited in the scope of a
fixed-schema/single table. Due to the scale of the dataset and the parameter
size of the prior models, we believe that we have not reached the ''BERT
moment'' for the ubiquitous tabular data. The development on this line
significantly lags behind the counterpart research domains such as natural
language processing. In this work, we first identify the crucial challenges
behind tabular data pretraining, particularly overcoming the cross-table
hurdle. As a pioneering endeavor, this work mainly (i)-contributes a
high-quality real-world tabular dataset, (ii)-proposes an innovative, generic,
and efficient cross-table pretraining framework, dubbed as CM2, where the core
to it comprises a semantic-aware tabular neural network that uniformly encodes
heterogeneous tables without much restriction and (iii)-introduces a novel
pretraining objective -- prompt Masked Table Modeling (pMTM) -- inspired by NLP
but intricately tailored to scalable pretraining on tables. Our extensive
experiments demonstrate CM2's state-of-the-art performance and validate that
cross-table pretraining can enhance various downstream tasks.
Related papers
- TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization [7.036380633387952]
We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing.
It can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks.
arXiv Detail & Related papers (2024-10-17T13:05:44Z) - Transformers with Stochastic Competition for Tabular Data Modelling [6.285325771390289]
We introduce a novel deep learning model specifically designed for tabular data.
The model is validated on a variety of widely-used, publicly available datasets.
We demonstrate that, through the incorporation of these elements, our model yields high performance.
arXiv Detail & Related papers (2024-07-18T07:48:48Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model
in Data Science [16.384705926693073]
This study seeks to extend the power of pretraining methodologies to facilitate the prediction over tables in data science.
We introduce UniTabE, a method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures.
In order to implement the pretraining phase, we curated an expansive dataset comprising approximately 13B samples, meticulously gathered from the Kaggle platform.
arXiv Detail & Related papers (2023-07-18T13:28:31Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - Embeddings for Tabular Data: A Survey [8.010589283146222]
Tabular data comprises rows (samples) with the same set of columns (attributes)
Tables are becoming the natural way of storing data among various industries and academia.
New line of research work applies various learning techniques to support various database tasks.
arXiv Detail & Related papers (2023-02-23T04:37:49Z) - Transfer Learning with Deep Tabular Models [66.67017691983182]
We show that upstream data gives tabular neural networks a decisive advantage over GBDT models.
We propose a realistic medical diagnosis benchmark for tabular transfer learning.
We propose a pseudo-feature method for cases where the upstream and downstream feature sets differ.
arXiv Detail & Related papers (2022-06-30T14:24:32Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.