Related papers: Deep Learning to Jointly Schema Match, Impute, and Transform Databases

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

URL: http://arxiv.org/abs/2207.03536v1
Date: Wed, 22 Jun 2022 21:25:59 GMT
Title: Deep Learning to Jointly Schema Match, Impute, and Transform Databases
Authors: Sandhya Tripathi, Bradley A. Fritz, Mohamed Abdelhack, Michael S. Avidan, Yixin Chen, and Christopher R. King
Abstract summary: Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing robust, generalizable algorithms. We develop two novel procedures to address this problem. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets.
Score: 19.200830026362425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.

Related papers

Multivariate Temporal Regression at Scale: A Three-Pillar Framework Combining ML, XAI, and NLP [1.331812695405053]
This paper dives into the hurdles of analyzing high-dimensional data, especially when it gets too complex. Traditional methods in data analysis often look at direct connections between input variables, which can miss out on the more complicated relationships within the data. We consider the role of synthetic data and how information can sometimes be redundant across different sensors.
arXiv Detail & Related papers (2025-04-02T21:53:03Z)
Continual Learning for Multimodal Data Fusion of a Soft Gripper [1.0589208420411014]
A model trained on one data modality often fails when tested with a different modality. We introduce a continual learning algorithm capable of incrementally learning different data modalities. We evaluate the algorithm's effectiveness on a challenging custom multimodal dataset.
arXiv Detail & Related papers (2024-09-20T09:53:27Z)
Domain Adaptive Synapse Detection with Weak Point Annotations [63.97144211520869]
We present AdaSyn, a framework for domain adaptive synapse detection with weak point annotations. In the WASPSYN challenge at I SBI 2023, our method ranks the 1st place.
arXiv Detail & Related papers (2023-08-31T05:05:53Z)
DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables. We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels. We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z)
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples. We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN) Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
Combining Varied Learners for Binary Classification using Stacked Generalization [3.1871776847712523]
This paper performs binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset. The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.
arXiv Detail & Related papers (2022-02-17T21:47:52Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks. In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other. This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z)
Propositionalization and Embeddings: Two Sides of the Same Coin [0.0]
This paper outlines some of the modern data processing techniques used in relational learning. It focuses on the propositionalization and embedding data transformation approaches. We present two efficient implementations of the unifying methodology.
arXiv Detail & Related papers (2020-06-08T08:33:21Z)
Bayesian Meta-Prior Learning Using Empirical Bayes [3.666114237131823]
We propose a hierarchical Empirical Bayes approach that addresses the absence of informative priors, and the inability to control parameter learning rates. Our method learns empirical meta-priors from the data itself and uses them to decouple the learning rates of first-order and second-order features. Our findings are promising, as optimizing over sparse data is often a challenge.
arXiv Detail & Related papers (2020-02-04T05:08:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.