Related papers: PTab: Using the Pre-trained Language Model for Modeling Tabular Data

PTab: Using the Pre-trained Language Model for Modeling Tabular Data

URL: http://arxiv.org/abs/2209.08060v1
Date: Thu, 15 Sep 2022 08:58:42 GMT
Title: PTab: Using the Pre-trained Language Model for Modeling Tabular Data
Authors: Guang Liu and Jie Yang and Ledell Wu
Abstract summary: Recent studies show that neural-based models are effective in learning contextual representation for Tabular data. We propose a novel framework PTab, using the Pre-trained language model to model Tabular data. Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
Score: 5.791972449406902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tabular data is the foundation of the information age and has been extensively studied. Recent studies show that neural-based models are effective in learning contextual representation for tabular data. The learning of an effective contextual representation requires meaningful features and a large amount of data. However, current methods often fail to properly learn a contextual representation from the features without semantic information. In addition, it's intractable to enlarge the training set through mixed tabular datasets due to the difference between datasets. To address these problems, we propose a novel framework PTab, using the Pre-trained language model to model Tabular data. PTab learns a contextual representation of tabular data through a three-stage processing: Modality Transformation(MT), Masked-Language Fine-tuning(MF), and Classification Fine-tuning(CF). We initialize our model with a pre-trained Model (PTM) which contains semantic information learned from the large-scale language data. Consequently, contextual representation can be learned effectively during the fine-tuning stages. In addition, we can naturally mix the textualized tabular data to enlarge the training set to further improve representation learning. We evaluate PTab on eight popular tabular classification datasets. Experimental results show that our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines(e.g. XGBoost), and outperforms counterpart methods under semi-supervised settings. We present visualization results that show PTab has well instance-based interpretability.

Related papers

Table Foundation Models: on knowledge pre-training for tabular learning [47.485516405457595]
TARTE is a foundation model that transforms tables to knowledge-enhanced vector representations using the string.<n>Pre-trained on large relational data, TARTE yields representations that facilitate subsequent learning with little additional cost.
arXiv Detail & Related papers (2025-05-20T14:27:51Z)
Representation Learning for Tabular Data: A Comprehensive Survey [23.606506938919605]
Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Deep Neural Networks (DNNs) have recently demonstrated promising results through their capability of representation learning. We organize existing methods into three main categories according to their generalization capabilities.
arXiv Detail & Related papers (2025-04-17T17:58:23Z)
Table2Image: Interpretable Tabular Data Classification with Realistic Image Transformations [5.62508658491325]
This paper introduces Table2Image, a novel framework that transforms tabular data into realistic and diverse image representations. We also present an interpretability framework that integrates insights from both the original data and its transformed image representations.
arXiv Detail & Related papers (2024-12-09T07:24:31Z)
TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization. Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks. TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts. We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
TabMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting [23.461204546005387]
TabMDA is a novel method for manifold data augmentation on tabular data. It exploits a pre-trained in-context model, such as TabPFN, to map the data into an embedding space. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various datasets.
arXiv Detail & Related papers (2024-06-03T21:51:13Z)
Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing. We present TP-BERTa, a specifically pre-trained LM for tabular data prediction. A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z)
Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks [31.82225213006849]
Tabular classification has traditionally relied on supervised algorithms, which estimate the parameters of a prediction model using its training data. Recently, Prior-Data Fitted Networks (PFNs) such as TabPFN have successfully learned to classify tabular data in-context. While such models show great promise, their applicability to real-world data remains limited due to the computational scale needed.
arXiv Detail & Related papers (2023-11-17T16:04:27Z)
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM) A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z)
Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports. We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM. We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z)
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning [150.17907456113537]
We present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 grade-level problems that require mathematical reasoning. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. We propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data.
arXiv Detail & Related papers (2022-09-29T08:01:04Z)
TabText: A Flexible and Contextual Approach to Tabular Data Representation [4.116980088382032]
TabText is a processing framework that extracts contextual information from tabular data structures. We show that TabText improves the average and worst-case AUC performance of standard machine learning models by as much as 6%.
arXiv Detail & Related papers (2022-06-21T13:28:57Z)
SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab) In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab) We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z)
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing. We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar. To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.