FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining
- URL: http://arxiv.org/abs/2109.07323v1
- Date: Wed, 15 Sep 2021 14:31:17 GMT
- Title: FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining
- Authors: Zhoujun Cheng, Haoyu Dong, Fan Cheng, Ran Jia, Pengfei Wu, Shi Han,
Dongmei Zhang
- Abstract summary: FORTAP is the first method for numerical-reasoning-aware table pretraining by leveraging large corpus of spreadsheet formulae.
FORTAP achieves results on two representative downstream tasks, cell type classification and formula prediction.
- Score: 23.747119682226675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tables store rich numerical data, but numerical reasoning over tables is
still a challenge. In this paper, we find that the spreadsheet formula, which
performs calculations on numerical values in tables, is naturally a strong
supervision of numerical reasoning. More importantly, large amounts of
spreadsheets with expert-made formulae are available on the web and can be
obtained easily. FORTAP is the first method for numerical-reasoning-aware table
pretraining by leveraging large corpus of spreadsheet formulae. We design two
formula pretraining tasks to explicitly guide FORTAP to learn numerical
reference and calculation in semi-structured tables. FORTAP achieves
state-of-the-art results on two representative downstream tasks, cell type
classification and formula prediction, showing great potential of
numerical-reasoning-aware pretraining.
Related papers
- FLEXTAF: Enhancing Table Reasoning with Flexible Tabular Formats [48.47559543509975]
We propose FLEXTAF-Single and FLEXTAF-Vote to enhance table reasoning performance by employing flexible formats.
Our experiments on WikiTableQuestions and TabFact reveal significant improvements, with average gains of 2.3% and 4.8%.
arXiv Detail & Related papers (2024-08-16T17:00:11Z) - Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations [36.2969566996675]
We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell.
We use contrastive-learning techniques inspired by "similar-face recognition" from compute vision.
arXiv Detail & Related papers (2024-04-19T03:28:18Z) - Making Pre-trained Language Models Great on Tabular Prediction [50.70574370855663]
The transferability of deep neural networks (DNNs) has made significant progress in image and language processing.
We present TP-BERTa, a specifically pre-trained LM for tabular data prediction.
A novel relative magnitude tokenization converts scalar numerical feature values to finely discrete, high-dimensional tokens, and an intra-feature attention approach integrates feature values with the corresponding feature names.
arXiv Detail & Related papers (2024-03-04T08:38:56Z) - NL2Formula: Generating Spreadsheet Formulas from Natural Language
Queries [29.33149993368329]
This paper introduces a novel benchmark task called NL2Formula.
The aim is to generate executable formulas that are grounded on a spreadsheet table, given a Natural Language (NL) query as input.
We construct a comprehensive dataset consisting of 70,799 paired NL queries and corresponding spreadsheet formulas, covering 21,670 tables and 37 types of formula functions.
arXiv Detail & Related papers (2024-02-20T05:58:05Z) - Generative Table Pre-training Empowers Models for Tabular Prediction [71.76829961276032]
We propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction.
TapTap can generate high-quality synthetic tables to support various applications, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.
It can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer.
arXiv Detail & Related papers (2023-05-16T06:37:38Z) - FLAME: A small language model for spreadsheet formulas [25.667479554632735]
We present FLAME, a transformer-based model trained exclusively on Excel formulas.
We use sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction.
We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval.
arXiv Detail & Related papers (2023-01-31T17:29:43Z) - OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort.
We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z) - Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?"
Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases.
We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases.
None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z) - SpreadsheetCoder: Formula Prediction from Semi-structured Context [70.41579328458116]
We propose a BERT-based model architecture to represent the tabular context in both row-based and column-based formats.
We train our model on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder achieves top-1 prediction accuracy of 42.51%.
Compared to the rule-based system, SpreadsheetCoder 82% assists more users in composing formulas on Google Sheets.
arXiv Detail & Related papers (2021-06-26T11:26:27Z) - TABBIE: Pretrained Representations of Tabular Data [22.444607481407633]
We devise a simple pretraining objective that learns exclusively from tabular data.
Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures.
A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.
arXiv Detail & Related papers (2021-05-06T11:15:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.