TabText: A Flexible and Contextual Approach to Tabular Data
Representation
- URL: http://arxiv.org/abs/2206.10381v4
- Date: Fri, 21 Jul 2023 20:34:02 GMT
- Title: TabText: A Flexible and Contextual Approach to Tabular Data
Representation
- Authors: Kimberly Villalobos Carballo, Liangyuan Na, Yu Ma, L\'eonard
Boussioux, Cynthia Zeng, Luis R. Soenksen, Dimitris Bertsimas
- Abstract summary: TabText is a processing framework that extracts contextual information from tabular data structures.
We show that TabText improves the average and worst-case AUC performance of standard machine learning models by as much as 6%.
- Score: 4.116980088382032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tabular data is essential for applying machine learning tasks across various
industries. However, traditional data processing methods do not fully utilize
all the information available in the tables, ignoring important contextual
information such as column header descriptions. In addition, pre-processing
data into a tabular format can remain a labor-intensive bottleneck in model
development. This work introduces TabText, a processing and feature extraction
framework that extracts contextual information from tabular data structures.
TabText addresses processing difficulties by converting the content into
language and utilizing pre-trained large language models (LLMs). We evaluate
our framework on nine healthcare prediction tasks ranging from patient
discharge, ICU admission, and mortality. We show that 1) applying our TabText
framework enables the generation of high-performing and simple machine learning
baseline models with minimal data pre-processing, and 2) augmenting
pre-processed tabular data with TabText representations improves the average
and worst-case AUC performance of standard machine learning models by as much
as 6%.
Related papers
- PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization [7.036380633387952]
We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing.
It can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks.
arXiv Detail & Related papers (2024-10-17T13:05:44Z) - UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition [55.153629718464565]
We introduce UniTabNet, a novel framework for table structure parsing based on the image-to-text model.
UniTabNet employs a divide-and-conquer'' strategy, utilizing an image-to-text model to decouple table cells and integrating both physical and logical decoders to reconstruct the complete table structure.
arXiv Detail & Related papers (2024-09-20T01:26:32Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - PixT3: Pixel-based Table-To-Text Generation [66.96636025277536]
We present PixT3, a multimodal table-to-text model that overcomes the challenges of linearization and input size limitations.
Experiments on the ToTTo and Logic2Text benchmarks show that PixT3 is competitive and superior to generators that operate solely on text.
arXiv Detail & Related papers (2023-11-16T11:32:47Z) - Towards Table-to-Text Generation with Pretrained Language Model: A Table
Structure Understanding and Text Deliberating Approach [60.03002572791552]
We propose a table structure understanding and text deliberating approach, namely TASD.
Specifically, we devise a three-layered multi-head attention network to realize the table-structure-aware text generation model.
Our approach can generate faithful and fluent descriptive texts for different types of tables.
arXiv Detail & Related papers (2023-01-05T14:03:26Z) - PTab: Using the Pre-trained Language Model for Modeling Tabular Data [5.791972449406902]
Recent studies show that neural-based models are effective in learning contextual representation for Tabular data.
We propose a novel framework PTab, using the Pre-trained language model to model Tabular data.
Our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines.
arXiv Detail & Related papers (2022-09-15T08:58:42Z) - SubTab: Subsetting Features of Tabular Data for Self-Supervised
Representation Learning [5.5616364225463055]
We introduce a new framework, Subsetting features of Tabular data (SubTab)
In this paper, we introduce a new framework, Subsetting features of Tabular data (SubTab)
We argue that reconstructing the data from the subset of its features rather than its corrupted version in an autoencoder setting can better capture its underlying representation.
arXiv Detail & Related papers (2021-10-08T20:11:09Z) - TABBIE: Pretrained Representations of Tabular Data [22.444607481407633]
We devise a simple pretraining objective that learns exclusively from tabular data.
Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures.
A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.
arXiv Detail & Related papers (2021-05-06T11:15:16Z) - Learning Better Representation for Tables by Self-Supervised Tasks [23.69766883380125]
We propose two self-supervised tasks, Number Ordering and Significance Ordering, to help to learn better table representation.
We test our methods on the widely used dataset ROTOWIRE which consists of NBA game statistic and related news.
arXiv Detail & Related papers (2020-10-15T09:03:38Z) - GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing [117.98107557103877]
We present GraPPa, an effective pre-training approach for table semantic parsing.
We construct synthetic question-pairs over high-free tables via a synchronous context-free grammar.
To maintain the model's ability to represent real-world data, we also include masked language modeling.
arXiv Detail & Related papers (2020-09-29T08:17:58Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.