Leveraging Data Recasting to Enhance Tabular Reasoning
- URL: http://arxiv.org/abs/2211.12641v1
- Date: Wed, 23 Nov 2022 00:04:57 GMT
- Title: Leveraging Data Recasting to Enhance Tabular Reasoning
- Authors: Aashna Jena, Vivek Gupta, Manish Shrivastava, Julian Martin
Eisenschlos
- Abstract summary: Prior work has mostly relied on two data generation strategies.
The first is human annotation, which yields linguistically diverse data but is difficult to scale.
The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
- Score: 21.970920861791015
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Creating challenging tabular inference data is essential for learning complex
reasoning. Prior work has mostly relied on two data generation strategies. The
first is human annotation, which yields linguistically diverse data but is
difficult to scale. The second category for creation is synthetic generation,
which is scalable and cost effective but lacks inventiveness. In this research,
we present a framework for semi-automatically recasting existing tabular data
to make use of the benefits of both approaches. We utilize our framework to
build tabular NLI instances from five datasets that were initially intended for
tasks like table2text creation, tabular Q/A, and semantic parsing. We
demonstrate that recasted data could be used as evaluation benchmarks as well
as augmentation data to enhance performance on tabular NLI tasks. Furthermore,
we investigate the effectiveness of models trained on recasted data in the
zero-shot scenario, and analyse trends in performance across different recasted
datasets types.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - TabDPT: Scaling Tabular Foundation Models [20.00390825519329]
We show how to harness the power of real data to improve performance and generalization.
Our model achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks.
TabDPT also demonstrates strong scaling as both model size and amount of available data increase.
arXiv Detail & Related papers (2024-10-23T18:00:00Z) - Numerical Literals in Link Prediction: A Critical Examination of Models and Datasets [2.5999037208435705]
Link Prediction models that incorporate numerical literals have shown minor improvements on existing benchmark datasets.
It is unclear whether a model is actually better in using numerical literals, or better capable of utilizing the graph structure.
We propose a methodology to evaluate LP models that incorporate numerical literals.
arXiv Detail & Related papers (2024-07-25T17:55:33Z) - TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks [30.922069185335246]
We find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature.
A considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines.
This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [67.47600679176963]
RDBs store vast amounts of rich, informative data spread across interconnected tables.
The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing.
We explore a class of baseline models predicated on converting multi-table datasets into graphs.
We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
arXiv Detail & Related papers (2024-04-28T15:04:54Z) - Training-Free Generalization on Heterogeneous Tabular Data via
Meta-Representation [67.30538142519067]
We propose Tabular data Pre-Training via Meta-representation (TabPTM)
A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences.
Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios.
arXiv Detail & Related papers (2023-10-31T18:03:54Z) - TabuLa: Harnessing Language Models for Tabular Data Synthesis [5.102332247789348]
We develop Tabula, a new type of data synthesizer based on the language model structure.
We show that Tabula averagely reduces 46.2% training time per epoch compared to current LLMs-based state-of-the-art algorithm.
We also propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data.
arXiv Detail & Related papers (2023-10-19T13:50:56Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.