Related papers: Making Table Understanding Work in Practice

Making Table Understanding Work in Practice

URL: http://arxiv.org/abs/2109.05173v1
Date: Sat, 11 Sep 2021 03:38:24 GMT
Title: Making Table Understanding Work in Practice
Authors: Madelon Hulsebos and Sneha Gathani and James Gale and Isil Dillig and Paul Groth and \c{C}a\u{g}atay Demiralp
Abstract summary: We discuss three challenges of deploying table understanding models and propose a framework to address them. We present SigmaTyper which encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model.
Score: 9.352813774921655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the semantics of tables at scale is crucial for tasks like data integration, preparation, and search. Table understanding methods aim at detecting a table's topic, semantic column types, column relations, or entities. With the rise of deep learning, powerful models have been developed for these tasks with excellent accuracy on benchmarks. However, we observe that there exists a gap between the performance of these models on these benchmarks and their applicability in practice. In this paper, we address the question: what do we need for these models to work in practice? We discuss three challenges of deploying table understanding models and propose a framework to address them. These challenges include 1) difficulty in customizing models to specific domains, 2) lack of training data for typical database tables often found in enterprises, and 3) lack of confidence in the inferences made by models. We present SigmaTyper which implements this framework for the semantic column type detection task. SigmaTyper encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model. Lastly, we highlight avenues for future research that further close the gap towards making table understanding effective in practice.

Related papers

TableGPT2: A Large Multimodal Model with Tabular Data Integration [22.77225649639725]
TableGPT2 is a model rigorously pre-trained and fine-tuned with over 593.8K tables and 2.36M high-quality query-table-outputs. One of TableGPT2's key innovations is its novel table encoder, specifically designed to capture schema-level and cell-level information.
arXiv Detail & Related papers (2024-11-04T13:03:13Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
Observatory: Characterizing Embeddings of Relational Tables [15.808819332614712]
Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
arXiv Detail & Related papers (2023-10-05T00:58:45Z)
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks [2.690048852269647]
We study the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models. Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
arXiv Detail & Related papers (2023-10-01T21:06:15Z)
CHORUS: Foundation Models for Unified Data Discovery and Exploration [6.85448651843431]
We show that foundation models are highly applicable to the data discovery and data exploration domain. We show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. This suggests a future direction in which disparate data management tasks can be unified under foundation models.
arXiv Detail & Related papers (2023-06-16T03:58:42Z)
Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data. Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Learning Models as Functionals of Signed-Distance Fields for Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene. We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z)
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance. We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z)
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
TURL: Table Understanding through Representation Learning [29.6016859927782]
TURL is a novel framework that introduces the pre-training/finetuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.
arXiv Detail & Related papers (2020-06-26T05:44:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.