Leveraging Data Recasting to Enhance Tabular Reasoning
- URL: http://arxiv.org/abs/2211.12641v1
- Date: Wed, 23 Nov 2022 00:04:57 GMT
- Title: Leveraging Data Recasting to Enhance Tabular Reasoning
- Authors: Aashna Jena, Vivek Gupta, Manish Shrivastava, Julian Martin
Eisenschlos
- Abstract summary: Prior work has mostly relied on two data generation strategies.
The first is human annotation, which yields linguistically diverse data but is difficult to scale.
The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness.
- Score: 21.970920861791015
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Creating challenging tabular inference data is essential for learning complex
reasoning. Prior work has mostly relied on two data generation strategies. The
first is human annotation, which yields linguistically diverse data but is
difficult to scale. The second category for creation is synthetic generation,
which is scalable and cost effective but lacks inventiveness. In this research,
we present a framework for semi-automatically recasting existing tabular data
to make use of the benefits of both approaches. We utilize our framework to
build tabular NLI instances from five datasets that were initially intended for
tasks like table2text creation, tabular Q/A, and semantic parsing. We
demonstrate that recasted data could be used as evaluation benchmarks as well
as augmentation data to enhance performance on tabular NLI tasks. Furthermore,
we investigate the effectiveness of models trained on recasted data in the
zero-shot scenario, and analyse trends in performance across different recasted
datasets types.
Related papers
- Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.
We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.
At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer [2.1677183904102257]
We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without pre-training on any real-world dataset.
APT is pre-trained with adversarial synthetic data agents, who deliberately challenge the model with different synthetic datasets.
We show that our framework matches state-of-the-art performance on small classification tasks without filtering on dataset characteristics.
arXiv Detail & Related papers (2025-02-06T23:58:11Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Numerical Literals in Link Prediction: A Critical Examination of Models and Datasets [2.5999037208435705]
Link Prediction models that incorporate numerical literals have shown minor improvements on existing benchmark datasets.
It is unclear whether a model is actually better in using numerical literals, or better capable of utilizing the graph structure.
We propose a methodology to evaluate LP models that incorporate numerical literals.
arXiv Detail & Related papers (2024-07-25T17:55:33Z) - TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks [30.922069185335246]
We find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature.
A considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines.
This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [67.47600679176963]
RDBs store vast amounts of rich, informative data spread across interconnected tables.
The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing.
We explore a class of baseline models predicated on converting multi-table datasets into graphs.
We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
arXiv Detail & Related papers (2024-04-28T15:04:54Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.