Survive the Schema Changes: Integration of Unmanaged Data Using Deep
Learning
- URL: http://arxiv.org/abs/2010.07586v1
- Date: Thu, 15 Oct 2020 08:10:37 GMT
- Title: Survive the Schema Changes: Integration of Unmanaged Data Using Deep
Learning
- Authors: Zijie Wang, Lixi Zhou, Amitabh Das, Valay Dave, Zhanpeng Jin, Jia Zou
- Abstract summary: We propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data.
Our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios.
- Score: 2.6464841907587004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data is the king in the age of AI. However data integration is often a
laborious task that is hard to automate. Schema change is one significant
obstacle to the automation of the end-to-end data integration process. Although
there exist mechanisms such as query discovery and schema modification language
to handle the problem, these approaches can only work with the assumption that
the schema is maintained by a database. However, we observe diversified schema
changes in heterogeneous data and open data, most of which has no schema
defined. In this work, we propose to use deep learning to automatically deal
with schema changes through a super cell representation and automatic injection
of perturbations to the training data to make the model robust to schema
changes. Our experimental results demonstrate that our proposed approach is
effective for two real-world data integration scenarios: coronavirus data
integration, and machine log integration.
Related papers
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.
We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - Compound Schema Registry [0.0]
We propose the use of generalized schema evolution (GSE) facilitated by a compound AI system.
This system employs Large Language Models (LLMs) to interpret the semantics of schema changes.
Our approach includes developing a task-specific language, Transformation Language (STL), to generate schema mappings as an intermediate representation.
arXiv Detail & Related papers (2024-06-17T05:50:46Z) - Automatic Recommendations for Evolving Relational Databases Schema [0.7412445894287709]
We present a meta-model that computes the impact of planned changes on the database schema.
We show that without detailed knowledge of the database, we could perform the same change in 75% less time than the expert database architect.
arXiv Detail & Related papers (2024-04-12T15:14:38Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - ReMatch: Retrieval Enhanced Schema Matching with LLMs [0.874967598360817]
We present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs)
Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher.
arXiv Detail & Related papers (2024-03-03T17:14:40Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - Automated Metadata Harmonization Using Entity Resolution & Contextual
Embedding [0.0]
We demonstrate automation of this step with the help of Cogntive Database's Db2Vec embedding approach.
Apart from matching schemas, we demonstrate that it can also infer the correct ontological structure of the target data model.
arXiv Detail & Related papers (2020-10-17T02:14:15Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.