SimClone: Detecting Tabular Data Clones using Value Similarity
- URL: http://arxiv.org/abs/2407.12802v1
- Date: Mon, 24 Jun 2024 04:16:32 GMT
- Title: SimClone: Detecting Tabular Data Clones using Value Similarity
- Authors: Xu Yang, Gopi Krishnan Rajbahadur, Dayi Lin, Shaowei Wang, Zhen Ming, Jiang,
- Abstract summary: Presence of data clones between datasets can cause issues when using datasets with clones to build AI software.
We propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information.
Our results show that our SimClone outperforms the current state-of-the-art method by at least 20% in terms of both F1-score and AUC.
- Score: 37.85935189975307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.
Related papers
- How the Misuse of a Dataset Harmed Semantic Clone Detection [0.9361474110798144]
This paper demonstrates that BigCloneBench is problematic to use as ground truth for learning or evaluating semantic code similarity.<n>In a literature review of 179 papers that use BigCloneBench as a dataset, we found 139 papers that used BigCloneBench to evaluate semantic clone detection.<n>We emphasise that using BigCloneBench remains valid for the intended purpose of evaluating syntactic or textual clone detection of Type-1, Type-2, and Type-3 clones.
arXiv Detail & Related papers (2025-05-07T10:52:28Z) - On the Use of Deep Learning Models for Semantic Clone Detection [4.796947520072581]
We propose a multi-step evaluation approach for five state-of-the-art clone detection models leveraging existing benchmark datasets.
Specifically, we examine three highly-performing single-language models (ASTNN, GMN, CodeBERT) on BigCloneBench, SemanticCloneBench, and GPTCloneBench.
While single-language models show high F1 scores for BigCloneBench, their performance on SemanticCloneBench varies (up to 20%)
Interestingly, the cross-language model (C4) shows superior performance (around 7%) on SemanticCloneBench over other models.
arXiv Detail & Related papers (2024-12-19T11:15:02Z) - Masked adversarial neural network for cell type deconvolution in spatial transcriptomics [5.1141169336435945]
We propose a Masked Adversarial Neural Network (MACD) to align real ST data with simulated ST data generated from scRNA-seq data.
We demonstrate its accuracy in performing cell type deconvolution on 32 simulated datasets and 2 real datasets.
arXiv Detail & Related papers (2024-08-09T13:46:28Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - You Only Condense Once: Two Rules for Pruning Condensed Datasets [41.92794134275854]
You Only Condense Once (YOCO) produces smaller condensed datasets with two embarrassingly simple dataset pruning rules.
Experiments validate our findings on networks including ConvNet, ResNet and DenseNet.
arXiv Detail & Related papers (2023-10-21T14:05:58Z) - Replication: Contrastive Learning and Data Augmentation in Traffic
Classification Using a Flowpic Input Representation [47.95762911696397]
We reproduce [16] on the same datasets and replicate its most salient aspect (the importance of data augmentation) on three additional public datasets.
While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset.
arXiv Detail & Related papers (2023-09-18T12:55:09Z) - Diffusion Dataset Generation: Towards Closing the Sim2Real Gap for
Pedestrian Detection [0.11470070927586014]
We propose a novel method of synthetic data creation meant to close the sim2real gap for the pedestrian detection task.
Our method uses a diffusion-based architecture to learn a real-world distribution which, once trained, is used to generate datasets.
We show that training on a combination of generated and simulated data increases average precision by as much as 27.3% for pedestrian detection models in real-world data.
arXiv Detail & Related papers (2023-05-16T12:33:51Z) - Uniform-in-Phase-Space Data Selection with Iterative Normalizing Flows [0.0]
A strategy is proposed to select data points such that they uniformly span the phase-space of the data.
An iterative method is used to accurately estimate the probability of the rare data points when only a small subset of the dataset is used to construct the probability map.
The proposed framework is demonstrated as a viable pathway to enable data-efficient machine learning when abundant data is available.
arXiv Detail & Related papers (2021-12-28T20:06:28Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z) - MSC: A Dataset for Macro-Management in StarCraft II [52.52008929278214]
We release a new macro-management dataset based on the platform SC2LE.
MSC consists of well-designed feature vectors, pre-defined high-level actions and final result of each match.
Besides the dataset, we propose a baseline model and present initial baseline results for global state evaluation and build order prediction.
arXiv Detail & Related papers (2017-10-09T14:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.