Model Joins: Enabling Analytics Over Joins of Absent Big Tables
- URL: http://arxiv.org/abs/2206.10434v1
- Date: Tue, 21 Jun 2022 14:28:24 GMT
- Title: Model Joins: Enabling Analytics Over Joins of Absent Big Tables
- Authors: Ali Mohammadi Shanghooshabad, Peter Triantafillou
- Abstract summary: This work puts forth a framework, Model Join, addressing these challenges.
The framework integrates and joins the per-table models of the absent tables.
The approximation stems from the models, but not from the Model Join framework.
- Score: 9.797488793708624
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This work is motivated by two key facts. First, it is highly desirable to be
able to learn and perform knowledge discovery and analytics (LKD) tasks without
the need to access raw-data tables. This may be due to organizations finding it
increasingly frustrating and costly to manage and maintain ever-growing tables,
or for privacy reasons. Hence, compact models can be developed from the raw
data and used instead of the tables. Second, oftentimes, LKD tasks are to be
performed on a (potentially very large) table which is itself the result of
joining separate (potentially very large) relational tables. But how can one do
this, when the individual to-be-joined tables are absent? Here, we pose the
following fundamental questions: Q1: How can one "join models" of
(absent/deleted) tables or "join models with other tables" in a way that
enables LKD as if it were performed on the join of the actual raw tables? Q2:
What are appropriate models to use per table? Q3: As the model join would be an
approximation of the actual data join, how can one evaluate the quality of the
model join result? This work puts forth a framework, Model Join, addressing
these challenges. The framework integrates and joins the per-table models of
the absent tables and generates a uniform and independent sample that is a
high-quality approximation of a uniform and independent sample of the actual
raw-data join. The approximation stems from the models, but not from the Model
Join framework. The sample obtained by the Model Join can be used to perform
LKD downstream tasks, such as approximate query processing, classification,
clustering, regression, association rule mining, visualization, and so on. To
our knowledge, this is the first work with this agenda and solutions. Detailed
experiments with TPC-DS data and synthetic data showcase Model Join's
usefulness.
Related papers
- TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes [25.169832192255956]
We present TabFM, a neural tabular model for data discovery over data lakes.
We finetune the pretrained model for identifying unionable, joinable, and subset table pairs.
Our results demonstrate significant improvements in F1 scores for search compared to state-of-the-art techniques.
arXiv Detail & Related papers (2024-06-28T17:28:53Z) - LaTable: Towards Large Tabular Models [63.995130144110156]
Tabular generative foundation models are hard to build due to the heterogeneous feature spaces of different datasets.
LaTable is a novel diffusion model that addresses these challenges and can be trained across different datasets.
We find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples.
arXiv Detail & Related papers (2024-06-25T16:03:50Z) - Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval [52.592071689901196]
We introduce a method that uncovers useful join relations for any query and database during table retrieval.
Our method outperforms the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score and for end-to-end QA by up to 5.4% in accuracy.
arXiv Detail & Related papers (2024-04-15T15:55:01Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Observatory: Characterizing Embeddings of Relational Tables [15.808819332614712]
Researchers and practitioners are keen to leverage language and table embedding models in many new application contexts.
There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.
We propose Observatory, a formal framework to systematically analyze embedding representations of relational tables.
arXiv Detail & Related papers (2023-10-05T00:58:45Z) - Testing the Limits of Unified Sequence to Sequence LLM Pretraining on
Diverse Table Data Tasks [2.690048852269647]
We study the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models.
arXiv Detail & Related papers (2023-10-01T21:06:15Z) - REaLTabFormer: Generating Realistic Relational and Tabular Data using
Transformers [0.0]
We introduce REaLTabFormer (Realistic and Tabular Transformer), a synthetic data generation model.
It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence model.
Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a model baseline.
arXiv Detail & Related papers (2023-02-04T00:32:50Z) - OmniTab: Pretraining with Natural and Synthetic Data for Few-shot
Table-based Question Answering [106.73213656603453]
We develop a simple table-based QA model with minimal annotation effort.
We propose an omnivorous pretraining approach that consumes both natural and synthetic data.
arXiv Detail & Related papers (2022-07-08T01:23:45Z) - Table Retrieval May Not Necessitate Table-specific Model Design [83.27735758203089]
We focus on the task of table retrieval, and ask: "is table-specific model design necessary for table retrieval?"
Based on an analysis on a table-based portion of the Natural Questions dataset (NQ-table), we find that structure plays a negligible role in more than 70% of the cases.
We then experiment with three modules to explicitly encode table structures, namely auxiliary row/column embeddings, hard attention masks, and soft relation-based attention biases.
None of these yielded significant improvements, suggesting that table-specific model design may not be necessary for table retrieval.
arXiv Detail & Related papers (2022-05-19T20:35:23Z) - Making Table Understanding Work in Practice [9.352813774921655]
We discuss three challenges of deploying table understanding models and propose a framework to address them.
We present SigmaTyper which encapsulates a hybrid model trained on GitTables and integrates a lightweight human-in-the-loop approach to customize the model.
arXiv Detail & Related papers (2021-09-11T03:38:24Z) - Model Reuse with Reduced Kernel Mean Embedding Specification [70.044322798187]
We present a two-phase framework for finding helpful models for a current application.
In the upload phase, when a model is uploading into the pool, we construct a reduced kernel mean embedding (RKME) as a specification for the model.
Then in the deployment phase, the relatedness of the current task and pre-trained models will be measured based on the value of the RKME specification.
arXiv Detail & Related papers (2020-01-20T15:15:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.