Making Table Understanding Work in Practice
- URL: http://arxiv.org/abs/2109.05173v1
- Date: Sat, 11 Sep 2021 03:38:24 GMT
- Title: Making Table Understanding Work in Practice
- Authors: Madelon Hulsebos and Sneha Gathani and James Gale and Isil Dillig and
Paul Groth and \c{C}a\u{g}atay Demiralp
- Abstract summary: We discuss three challenges of deploying table understanding models and propose a framework to address them.
We present SigmaTyper which encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model.
- Score: 9.352813774921655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the semantics of tables at scale is crucial for tasks like data
integration, preparation, and search. Table understanding methods aim at
detecting a table's topic, semantic column types, column relations, or
entities. With the rise of deep learning, powerful models have been developed
for these tasks with excellent accuracy on benchmarks. However, we observe that
there exists a gap between the performance of these models on these benchmarks
and their applicability in practice. In this paper, we address the question:
what do we need for these models to work in practice?
We discuss three challenges of deploying table understanding models and
propose a framework to address them. These challenges include 1) difficulty in
customizing models to specific domains, 2) lack of training data for typical
database tables often found in enterprises, and 3) lack of confidence in the
inferences made by models. We present SigmaTyper which implements this
framework for the semantic column type detection task. SigmaTyper encapsulates
a hybrid model trained on GitTables and integrates a lightweight
human-in-the-loop approach to customize the model. Lastly, we highlight avenues
for future research that further close the gap towards making table
understanding effective in practice.
Related papers
- TableGPT2: A Large Multimodal Model with Tabular Data Integration [22.77225649639725]
TableGPT2 is a model rigorously pre-trained and fine-tuned with over 593.8K tables and 2.36M high-quality query-table-outputs.
One of TableGPT2's key innovations is its novel table encoder, specifically designed to capture schema-level and cell-level information.
arXiv Detail & Related papers (2024-11-04T13:03:13Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - Observatory: Characterizing Embeddings of Relational Tables [15.808819332614712]
Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts.
There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
arXiv Detail & Related papers (2023-10-05T00:58:45Z) - Testing the Limits of Unified Sequence to Sequence LLM Pretraining on
Diverse Table Data Tasks [2.690048852269647]
We study the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
arXiv Detail & Related papers (2023-10-01T21:06:15Z) - CHORUS: Foundation Models for Unified Data Discovery and Exploration [6.85448651843431]
We show that foundation models are highly applicable to the data discovery and data exploration domain.
We show that a foundation-model-based approach outperforms the task-specific models and so the state of the art.
This suggests a future direction in which disparate data management tasks can be unified under foundation models.
arXiv Detail & Related papers (2023-06-16T03:58:42Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - Learning Models as Functionals of Signed-Distance Fields for
Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene.
We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - TURL: Table Understanding through Representation Learning [29.6016859927782]
TURL is a novel framework that introduces the pre-training/finetuning paradigm to relational Web tables.
During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner.
We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.
arXiv Detail & Related papers (2020-06-26T05:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.