FLAME: A small language model for spreadsheet formulas
- URL: http://arxiv.org/abs/2301.13779v2
- Date: Tue, 19 Dec 2023 22:56:39 GMT
- Title: FLAME: A small language model for spreadsheet formulas
- Authors: Harshit Joshi, Abishai Ebenezer, Jos\'e Cambronero, Sumit Gulwani,
Aditya Kanade, Vu Le, Ivan Radi\v{c}ek, Gust Verbruggen
- Abstract summary: We present FLAME, a transformer-based model trained exclusively on Excel formulas.
We use sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction.
We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval.
- Score: 25.667479554632735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spreadsheets are a vital tool for end-user data management. Using large
language models for formula authoring assistance in these environments can be
difficult, as these models are expensive to train and challenging to deploy due
to their size (up to billions of parameters). We present FLAME, a
transformer-based model trained exclusively on Excel formulas that leverages
domain insights to achieve competitive performance while being substantially
smaller (60M parameters) and training on two orders of magnitude less data. We
curate a training dataset using sketch deduplication, introduce an
Excel-specific formula tokenizer, and use domain-specific versions of masked
span prediction and noisy auto-encoding as pre-training objectives. We evaluate
FLAME on formula repair, formula completion, and similarity-based formula
retrieval. FLAME can outperform much larger models, such as the Davinci (175B)
and Cushman (12B) variants of Codex and CodeT5 (220M), in 10 of 14 evaluation
settings for the repair and completion tasks. For formula retrieval, FLAME
outperforms CodeT5, CodeBERT, and GraphCodeBERT.
Related papers
- 1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data [0.0]
This paper presents a compute-efficient approach to pre-training a Language Model-the "1.5-Pints"-in only 9 days.
Based on MT-Bench (a benchmark that emulates human judgments), 1.5-Pints outperforms Apple's OpenELM and Microsoft's Phi.
This is achieved by a carefully curated pre-training dataset of 57 billion tokens, using a mix of automated and manual human review.
arXiv Detail & Related papers (2024-08-07T02:14:52Z) - SpreadsheetLLM: Encoding Spreadsheets for Large Language Models [44.08092362611575]
SpreadsheetLLM is an efficient encoding method designed to unleash and optimize large language models (LLMs) on spreadsheets.
We develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs.
Fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%.
arXiv Detail & Related papers (2024-07-12T06:34:21Z) - Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations [36.2969566996675]
We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell.
We use contrastive-learning techniques inspired by "similar-face recognition" from compute vision.
arXiv Detail & Related papers (2024-04-19T03:28:18Z) - NL2Formula: Generating Spreadsheet Formulas from Natural Language
Queries [29.33149993368329]
This paper introduces a novel benchmark task called NL2Formula.
The aim is to generate executable formulas that are grounded on a spreadsheet table, given a Natural Language (NL) query as input.
We construct a comprehensive dataset consisting of 70,799 paired NL queries and corresponding spreadsheet formulas, covering 21,670 tables and 37 types of formula functions.
arXiv Detail & Related papers (2024-02-20T05:58:05Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Benchmarking Diverse-Modal Entity Linking with Generative Models [78.93737257356784]
We construct a benchmark for diverse-modal EL (DMEL) from existing EL datasets.
To approach the DMEL task, we proposed a generative diverse-modal model (GDMM) following a multimodal-encoder-decoder paradigm.
GDMM builds a stronger DMEL baseline, outperforming state-of-the-art task-specific EL models by 8.51 F1 score on average.
arXiv Detail & Related papers (2023-05-27T02:38:46Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining [23.747119682226675]
FORTAP is the first method for numerical-reasoning-aware table pretraining by leveraging large corpus of spreadsheet formulae.
FORTAP achieves results on two representative downstream tasks, cell type classification and formula prediction.
arXiv Detail & Related papers (2021-09-15T14:31:17Z) - SpreadsheetCoder: Formula Prediction from Semi-structured Context [70.41579328458116]
We propose a BERT-based model architecture to represent the tabular context in both row-based and column-based formats.
We train our model on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder achieves top-1 prediction accuracy of 42.51%.
Compared to the rule-based system, SpreadsheetCoder 82% assists more users in composing formulas on Google Sheets.
arXiv Detail & Related papers (2021-06-26T11:26:27Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.