DCoM: A Deep Column Mapper for Semantic Data Type Detection
- URL: http://arxiv.org/abs/2106.12871v1
- Date: Thu, 24 Jun 2021 10:12:35 GMT
- Title: DCoM: A Deep Column Mapper for Semantic Data Type Detection
- Authors: Subhadip Maji, Swapna Sourav Rout and Sudeep Choudhary
- Abstract summary: We introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types.
We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detection of semantic data types is a very crucial task in data science for
automated data cleaning, schema matching, data discovery, semantic data type
normalization and sensitive data identification. Existing methods include
regular expression-based or dictionary lookup-based methods that are not robust
to dirty as well unseen data and are limited to a very less number of semantic
data types to predict. Existing Machine Learning methods extract large number
of engineered features from data and build logistic regression, random forest
or feedforward neural network for this purpose. In this paper, we introduce
DCoM, a collection of multi-input NLP-based deep neural networks to detect
semantic data types where instead of extracting large number of features from
the data, we feed the raw values of columns (or instances) to the model as
texts. We train DCoM on 686,765 data columns extracted from VizNet corpus with
78 different semantic data types. DCoM outperforms other contemporary results
with a quite significant margin on the same dataset.
Related papers
- Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining.
Experiments on the MNIST dataset for handwritten digit recognition were performed.
It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z) - DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control [68.14798033899955]
Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content.
However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation?
We investigate this question in the context of autonomous driving, and answer it with a resounding "yes"
arXiv Detail & Related papers (2023-12-05T18:34:12Z) - DiffusionEngine: Diffusion Model is Scalable Data Engine for Object
Detection [41.436817746749384]
Diffusion Model is a scalable data engine for object detection.
DiffusionEngine (DE) provides high-quality detection-oriented training pairs in a single stage.
arXiv Detail & Related papers (2023-09-07T17:55:01Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z) - Explaining Classifiers Trained on Raw Hierarchical Multiple-Instance
Data [0.0]
A number of data sources have the natural form of structured data interchange formats (e.g. Multiple security logs in/XML format)
Existing methods, such as in Hierarchical Instance Learning (HMIL) allow learning from such data in their raw form.
By treating these models as sub-set selections problems, we demonstrate how interpretable explanations, with favourable properties, can be generated using computationally efficient algorithms.
We compare to an explanation technique adopted from graph neural networks showing an order of magnitude speed-up and higher-quality explanations.
arXiv Detail & Related papers (2022-08-04T14:48:37Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Missing Value Imputation on Multidimensional Time Series [16.709162372224355]
We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets.
DeepMVI combines fine-grained and coarse-grained patterns along a time series, and trends from related series across categorical dimensions.
Experiments show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases.
arXiv Detail & Related papers (2021-03-02T09:55:05Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Neural Data Server: A Large-Scale Search Engine for Transfer Learning
Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain.
NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client.
We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.