Observatory: Characterizing Embeddings of Relational Tables
- URL: http://arxiv.org/abs/2310.07736v3
- Date: Sat, 27 Jan 2024 17:54:48 GMT
- Title: Observatory: Characterizing Embeddings of Relational Tables
- Authors: Tianji Cong, Madelon Hulsebos, Zhenjie Sun, Paul Groth, H. V. Jagadish
- Abstract summary: Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts.
There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
- Score: 15.808819332614712
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Language models and specialized table embedding models have recently
demonstrated strong performance on many tasks over tabular data. Researchers
and practitioners are keen to leverage these models in many new application
contexts; but limited understanding of the strengths and weaknesses of these
models, and the table representations they generate, makes the process of
finding a suitable model for a given task reliant on trial and error. There is
an urgent need to gain a comprehensive understanding of these models to
minimize inefficiency and failures in downstream usage.
To address this need, we propose Observatory, a formal framework to
systematically analyze embedding representations of relational tables.
Motivated both by invariants of the relational data model and by statistical
considerations regarding data distributions, we define eight primitive
properties, and corresponding measures to quantitatively characterize table
embeddings for these properties. Based on these properties, we define an
extensible framework to evaluate language and table embedding models. We
collect and synthesize a suite of datasets and use Observatory to analyze nine
such models. Our analysis provides insights into the strengths and weaknesses
of learned representations over tables. We find, for example, that some models
are sensitive to table structure such as column order, that functional
dependencies are rarely reflected in embeddings, and that specialized table
embedding models have relatively lower sample fidelity. Such insights help
researchers and practitioners better anticipate model behaviors and select
appropriate models for their downstream tasks, while guiding researchers in the
development of new models.
Related papers
- Learning-based Models for Vulnerability Detection: An Extensive Study [3.1317409221921144]
We extensively and comprehensively investigate two types of state-of-the-art learning-based approaches.
We experimentally demonstrate the priority of sequence-based models and the limited abilities of both graph-based models.
arXiv Detail & Related papers (2024-08-14T13:01:30Z) - Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications [0.0]
Tabular Embedding Model (TEM) is a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications.
TEM not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.
arXiv Detail & Related papers (2024-04-28T14:58:55Z) - Wiki-TabNER:Advancing Table Interpretation Through Named Entity
Recognition [19.423556742293762]
We analyse a widely used benchmark dataset for evaluation of TI tasks.
To overcome this drawback, we construct and annotate a new more challenging dataset.
We propose a prompting framework for evaluating the newly developed large language models.
arXiv Detail & Related papers (2024-03-07T15:22:07Z) - Minimal Value-Equivalent Partial Models for Scalable and Robust Planning
in Lifelong Reinforcement Learning [56.50123642237106]
Common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment.
We argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios.
We propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-minimal partial models"
arXiv Detail & Related papers (2023-01-24T16:40:01Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - Making Table Understanding Work in Practice [9.352813774921655]
We discuss three challenges of deploying table understanding models and propose a framework to address them.
We present SigmaTyper which encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model.
arXiv Detail & Related papers (2021-09-11T03:38:24Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - CHEER: Rich Model Helps Poor Model via Knowledge Infusion [69.23072792708263]
We develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations.
Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
arXiv Detail & Related papers (2020-05-21T21:44:21Z) - Explainable Matrix -- Visualization for Global and Local
Interpretability of Random Forest Classification Ensembles [78.6363825307044]
We propose Explainable Matrix (ExMatrix), a novel visualization method for Random Forest (RF) interpretability.
It employs a simple yet powerful matrix-like visual metaphor, where rows are rules, columns are features, and cells are rules predicates.
ExMatrix applicability is confirmed via different examples, showing how it can be used in practice to promote RF models interpretability.
arXiv Detail & Related papers (2020-05-08T21:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.