PTab: Using the Pre-trained Language Model for Modeling Tabular Data
- URL: http://arxiv.org/abs/2209.08060v1
- Date: Thu, 15 Sep 2022 08:58:42 GMT
- Title: PTab: Using the Pre-trained Language Model for Modeling Tabular Data
- Authors: Guang Liu and Jie Yang and Ledell Wu
- Abstract summary: Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
- Score: 5.791972449406902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tabular data is the foundation of the information age and has been
extensively studied. Recent studies show that neural-based models are effective
in learning contextual representation for tabular data. The learning of an
effective contextual representation requires meaningful features and a large
amount of data. However, current methods often fail to properly learn a
contextual representation from the features without semantic information. In
addition, it's intractable to enlarge the training set through mixed tabular
datasets due to the difference between datasets. To address these problems, we
propose a novel framework PTab, using the Pre-trained language model to model
Tabular data. PTab learns a contextual representation of tabular data through a
three-stage processing: Modality Transformation(MT), Masked-Language
Fine-tuning(MF), and Classification Fine-tuning(CF). We initialize our model
with a pre-trained Model (PTM) which contains semantic information learned from
the large-scale language data. Consequently, contextual representation can be
learned effectively during the fine-tuning stages. In addition, we can
naturally mix the textualized tabular data to enlarge the training set to
further improve representation learning. We evaluate PTab on eight popular
tabular classification datasets. Experimental results show that our method has
achieved a better average AUC score in supervised settings compared to the
state-of-the-art baselines(e.g. XGBoost), and outperforms counterpart methods
under semi-supervised settings. We present visualization results that show PTab
has well instance-based interpretability.
Related papers
- TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting [23.461204546005387]
TabMDA is a novel method for manifold data augmentation on tabular data.
It exploits a pre-trained in-context model, such as TabPFN, to map the data into an embedding space.
We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various datasets.
arXiv Detail & Related papers (2024-06-03T21:51:13Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data
Fitted Networks [31.82225213006849]
Tabular classification has traditionally relied on supervised algorithms, which estimate the parameters of a prediction model using its training data.
Recently, Prior-Data Fitted Networks (PFNs) such as TabPFN have successfully learned to classify tabular data in-context.
While such models show great promise, their applicability to real-world data remains limited due to the computational scale needed.
arXiv Detail & Related papers (2023-11-17T16:04:27Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Dynamic Prompt Learning via Policy Gradient for Semi-structured
Mathematical Reasoning [150.17907456113537]
We present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 grade-level problems that require mathematical reasoning.
We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting.
We propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data.
arXiv Detail & Related papers (2022-09-29T08:01:04Z) - TabText: A Flexible and Contextual Approach to Tabular Data
Representation [4.116980088382032]
TabText is a processing framework that extracts contextual information from tabular data structures.
We show that TabText improves the average and worst-case AUC performance of standard machine learning models by as much as 6%.
arXiv Detail & Related papers (2022-06-21T13:28:57Z) - SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.