DeepJoin: Joinable Table Discovery with Pre-trained Language Models
- URL: http://arxiv.org/abs/2212.07588v2
- Date: Fri, 23 Jun 2023 14:58:03 GMT
- Title: DeepJoin: Joinable Table Discovery with Pre-trained Language Models
- Authors: Yuyang Dong, Chuan Xiao, Takuma Nozawa, Masafumi Enomoto, Masafumi
Oyamada
- Abstract summary: Existing approaches target equi-joins, the most common way of combining tables for creating a unified view.
Deepjoin is a deep learning model for accurate and efficient joinable table discovery.
Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts.
- Score: 10.639106014582756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the usefulness in data enrichment for data analysis tasks, joinable
table discovery has become an important operation in data lake management.
Existing approaches target equi-joins, the most common way of combining tables
for creating a unified view, or semantic joins, which tolerate misspellings and
different formats to deliver more join results. They are either exact solutions
whose running time is linear in the sizes of query column and target table
repository or approximate solutions lacking precision. In this paper, we
propose Deepjoin, a deep learning model for accurate and efficient joinable
table discovery. Our solution is an embedding-based retrieval, which employs a
pre-trained language model (PLM) and is designed as one framework serving both
equi- and semantic joins. We propose a set of contextualization options to
transform column contents to a text sequence. The PLM reads the sequence and is
fine-tuned to embed columns to vectors such that columns are expected to be
joinable if they are close to each other in the vector space. Since the output
of the PLM is fixed in length, the subsequent search procedure becomes
independent of the column size. With a state-of-the-art approximate nearest
neighbor search algorithm, the search time is logarithmic in the repository
size. To train the model, we devise the techniques for preparing training data
as well as data augmentation. The experiments on real datasets demonstrate that
by training on a small subset of a corpus, Deepjoin generalizes to large
datasets and its precision consistently outperforms other approximate
solutions'. Deepjoin is even more accurate than an exact solution to semantic
joins when evaluated with labels from experts. Moreover, when equipped with a
GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.
Related papers
- Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases [50.552056536968166]
We propose and evaluate an algorithm for automating column matching on two large, popular and publicly-accessible EHR databases.
Our approach achieves a high top-three accuracy of $92%$, correctly matching $12$ out of the $13$ columns of interest, when using a small, pre-trained general purpose language model.
arXiv Detail & Related papers (2024-12-16T06:19:35Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - TablePuppet: A Generic Framework for Relational Federated Learning [27.274856376963356]
Current federated learning (FL) approaches view decentralized training data as a single table, divided among participants either horizontally (by rows) or vertically (by columns)
This scenario requires intricate operations like joins and unions to obtain the training data, which is either costly or restricted by privacy concerns.
We propose TablePuppet, a generic framework for RFL that decomposes the learning process into two steps: (1) learning over join (LoJ) followed by (2) learning over union (LoU)
arXiv Detail & Related papers (2024-03-23T13:28:37Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Relational Deep Learning: Graph Representation Learning on Relational
Databases [69.7008152388055]
We introduce an end-to-end representation approach to learn on data laid out across multiple tables.
Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all data input.
arXiv Detail & Related papers (2023-12-07T18:51:41Z) - Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection [2.3814052021083354]
This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem.
Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search.
We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy.
arXiv Detail & Related papers (2023-11-05T06:12:03Z) - Diversity-Aware Meta Visual Prompting [111.75306320834629]
We present Diversity-Aware Meta Visual Prompting(DAM-VP), an efficient prompting method for transferring pre-trained models to downstream tasks with frozen backbone.
We cluster the downstream dataset into small subsets in a diversity-strapped way, with each subset has its own prompt separately.
All the prompts are optimized with a meta-prompt, which is learned across several datasets.
arXiv Detail & Related papers (2023-03-14T17:59:59Z) - Flag Aggregator: Scalable Distributed Training under Failures and
Augmented Losses using Convex Optimization [14.732408788010313]
ML applications increasingly rely on complex deep learning models and large datasets.
To scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model.
With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems.
We show that our approach significantly enhances the robustness of state-of-the-art Byzantine resilient aggregators.
arXiv Detail & Related papers (2023-02-12T06:38:30Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z) - TableQnA: Answering List Intent Queries With Web Tables [12.941073798838167]
We focus on answering two classes of queries with HTML tables: those seeking lists of entities and those seeking superlative entities.
Existing approaches train machine learning models to select the answer from the candidates.
We develop novel features to compute structure-aware match and train a machine learning model.
arXiv Detail & Related papers (2020-01-10T01:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.