Related papers: LSH methods for data deduplication in a Wikipedia artificial dataset

Related papers

Data Augmentation with Diffusion Models for Colon Polyp Localization on the Low Data Regime: How much real data is enough? [38.635356845350394]
We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.
arXiv Detail & Related papers (2024-11-28T05:25:33Z)
Extracting Training Data from Unconditional Diffusion Models [76.85077961718875]
diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI) We aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization. Based on the theoretical analysis, we propose a novel data extraction method called textbfSurrogate condItional Data Extraction (SIDE) that leverages a trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models.
arXiv Detail & Related papers (2024-06-18T16:20:12Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
DiffPuter: Empowering Diffusion Models for Missing Data Imputation [56.48119008663155]
This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.<n>Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density.<n>Our experiments show that DiffPuter achieves an average improvement of 6.94% in MAE and 4.78% in RMSE compared to the most competitive existing method.
arXiv Detail & Related papers (2024-05-31T08:35:56Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
On the Limitation of Diffusion Models for Synthesizing Training Datasets [5.384630221560811]
This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. We found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-11-22T01:42:23Z)
MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values. We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective. We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z)
Representation Transfer Learning via Multiple Pre-trained models for Linear Regression [3.5788754401889014]
We consider the problem of learning a linear regression model on a data domain of interest (target) given few samples. To aid learning, we are provided with a set of pre-trained regression models that are trained on potentially different data domains. We propose a representation transfer based learning method for constructing the target model.
arXiv Detail & Related papers (2023-05-25T19:35:24Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Diffusion models for missing value imputation in tabular data [10.599563005836066]
Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. We propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (CSDI_T) To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization.
arXiv Detail & Related papers (2022-10-31T08:13:26Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Data from Model: Extracting Data from Non-robust and Robust Models [83.60161052867534]
This work explores the reverse process of generating data from a model, attempting to reveal the relationship between the data and the model. We repeat the process of Data to Model (DtM) and Data from Model (DfM) in sequence and explore the loss of feature mapping information. Our results show that the accuracy drop is limited even after multiple sequences of DtM and DfM, especially for robust models.
arXiv Detail & Related papers (2020-07-13T05:27:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.