Deep Learning to Jointly Schema Match, Impute, and Transform Databases
- URL: http://arxiv.org/abs/2207.03536v1
- Date: Wed, 22 Jun 2022 21:25:59 GMT
- Title: Deep Learning to Jointly Schema Match, Impute, and Transform Databases
- Authors: Sandhya Tripathi, Bradley A. Fritz, Mohamed Abdelhack, Michael S.
Avidan, Yixin Chen, and Christopher R. King
- Abstract summary: Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing robust, generalizable algorithms.
We develop two novel procedures to address this problem.
In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets.
- Score: 19.200830026362425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An applied problem facing all areas of data science is harmonizing data
sources. Joining data from multiple origins with unmapped and only partially
overlapping features is a prerequisite to developing and testing robust,
generalizable algorithms, especially in health care. We approach this issue in
the common but difficult case of numeric features such as nearly Gaussian and
binary features, where unit changes and variable shift make simple matching of
univariate summaries unsuccessful. We develop two novel procedures to address
this problem. First, we demonstrate multiple methods of "fingerprinting" a
feature based on its associations to other features. In the setting of even
modest prior information, this allows most shared features to be accurately
identified. Second, we demonstrate a deep learning algorithm for translation
between databases. Unlike prior approaches, our algorithm takes advantage of
discovered mappings while identifying surrogates for unshared features and
learning transformations. In synthetic and real-world experiments using two
electronic health record databases, our algorithms outperform existing
baselines for matching variable sets, while jointly learning to impute unshared
or transformed variables.
Related papers
- Domain Adaptive Synapse Detection with Weak Point Annotations [63.97144211520869]
We present AdaSyn, a framework for domain adaptive synapse detection with weak point annotations.
In the WASPSYN challenge at I SBI 2023, our method ranks the 1st place.
arXiv Detail & Related papers (2023-08-31T05:05:53Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - Combining Varied Learners for Binary Classification using Stacked
Generalization [3.1871776847712523]
This paper performs binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset.
The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.
arXiv Detail & Related papers (2022-02-17T21:47:52Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Semantic Parsing Natural Language into Relational Algebra [4.56877715768796]
Natural interface to database (NLIDB) has been researched a lot during the past decades.
Recent progress in neural deep learning seems to provide a promising direction towards building a general NLIDB system.
arXiv Detail & Related papers (2021-06-25T19:36:02Z) - Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks.
In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other.
This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z) - Propositionalization and Embeddings: Two Sides of the Same Coin [0.0]
This paper outlines some of the modern data processing techniques used in relational learning.
It focuses on the propositionalization and embedding data transformation approaches.
We present two efficient implementations of the unifying methodology.
arXiv Detail & Related papers (2020-06-08T08:33:21Z) - Bayesian Meta-Prior Learning Using Empirical Bayes [3.666114237131823]
We propose a hierarchical Empirical Bayes approach that addresses the absence of informative priors, and the inability to control parameter learning rates.
Our method learns empirical meta-priors from the data itself and uses them to decouple the learning rates of first-order and second-order features.
Our findings are promising, as optimizing over sparse data is often a challenge.
arXiv Detail & Related papers (2020-02-04T05:08:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.