SpreadsheetCoder: Formula Prediction from Semi-structured Context
- URL: http://arxiv.org/abs/2106.15339v1
- Date: Sat, 26 Jun 2021 11:26:27 GMT
- Title: SpreadsheetCoder: Formula Prediction from Semi-structured Context
- Authors: Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun
Dai, Max Lin, Denny Zhou
- Abstract summary: We propose a BERT-based model architecture to represent the tabular context in both row-based and column-based formats.
We train our model on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder achieves top-1 prediction accuracy of 42.51%.
Compared to the rule-based system, SpreadsheetCoder 82% assists more users in composing formulas on Google Sheets.
- Score: 70.41579328458116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spreadsheet formula prediction has been an important program synthesis
problem with many real-world applications. Previous works typically utilize
input-output examples as the specification for spreadsheet formula synthesis,
where each input-output pair simulates a separate row in the spreadsheet.
However, this formulation does not fully capture the rich context in real-world
spreadsheets. First, spreadsheet data entries are organized as tables, thus
rows and columns are not necessarily independent from each other. In addition,
many spreadsheet tables include headers, which provide high-level descriptions
of the cell data. However, previous synthesis approaches do not consider
headers as part of the specification. In this work, we present the first
approach for synthesizing spreadsheet formulas from tabular context, which
includes both headers and semi-structured tabular data. In particular, we
propose SpreadsheetCoder, a BERT-based model architecture to represent the
tabular context in both row-based and column-based formats. We train our model
on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder
achieves top-1 prediction accuracy of 42.51%, which is a considerable
improvement over baselines that do not employ rich tabular context. Compared to
the rule-based system, SpreadsheetCoder assists 82% more users in composing
formulas on Google Sheets.
Related papers
- SpreadsheetLLM: Encoding Spreadsheets for Large Language Models [44.08092362611575]
SpreadsheetLLM is an efficient encoding method designed to unleash and optimize large language models (LLMs) on spreadsheets.
We develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs.
Fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%.
arXiv Detail & Related papers (2024-07-12T06:34:21Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation [34.8332394229927]
SpreadsheetBench is designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users.
Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums.
Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance.
arXiv Detail & Related papers (2024-06-21T09:06:45Z) - Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations [36.2969566996675]
We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell.
We use contrastive-learning techniques inspired by "similar-face recognition" from compute vision.
arXiv Detail & Related papers (2024-04-19T03:28:18Z) - NL2Formula: Generating Spreadsheet Formulas from Natural Language
Queries [29.33149993368329]
This paper introduces a novel benchmark task called NL2Formula.
The aim is to generate executable formulas that are grounded on a spreadsheet table, given a Natural Language (NL) query as input.
We construct a comprehensive dataset consisting of 70,799 paired NL queries and corresponding spreadsheet formulas, covering 21,670 tables and 37 types of formula functions.
arXiv Detail & Related papers (2024-02-20T05:58:05Z) - OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort.
We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z) - Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?"
Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases.
We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases.
None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z) - TCN: Table Convolutional Network for Web Table Interpretation [52.32515851633981]
We propose a novel table representation learning approach considering both the intra- and inter-table contextual information.
Our method can outperform competitive baselines by +4.8% of F1 for column type prediction and by +4.1% of F1 for column pairwise relation prediction.
arXiv Detail & Related papers (2021-02-17T02:18:10Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - Identifying Table Structure in Documents using Conditional Generative
Adversarial Networks [0.0]
In many industries and in academic research, information is primarily transmitted in the form of unstructured documents.
We propose a top-down approach, first using a conditional generative adversarial network to map a table image into a standardised skeleton' table form.
We then deriving latent table structure using xy-cut projection and Genetic Algorithm optimisation.
arXiv Detail & Related papers (2020-01-13T20:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.