Finding Meaningful Distributions of ML Black-boxes under Forensic
Investigation
- URL: http://arxiv.org/abs/2305.05869v1
- Date: Wed, 10 May 2023 03:25:23 GMT
- Title: Finding Meaningful Distributions of ML Black-boxes under Forensic
Investigation
- Authors: Jiyi Zhang, Han Fang, Hwee Kuan Lee, Ee-Chien Chang
- Abstract summary: Given a poorly documented neural network model, we take the perspective of a forensic investigator who wants to find out the model's data domain.
We propose solving this problem by leveraging on comprehensive corpus such as ImageNet to select a meaningful distribution.
Our goal is to select a set of samples from the corpus for the given model.
- Score: 25.79728190384834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a poorly documented neural network model, we take the perspective of a
forensic investigator who wants to find out the model's data domain (e.g.
whether on face images or traffic signs). Although existing methods such as
membership inference and model inversion can be used to uncover some
information about an unknown model, they still require knowledge of the data
domain to start with. In this paper, we propose solving this problem by
leveraging on comprehensive corpus such as ImageNet to select a meaningful
distribution that is close to the original training distribution and leads to
high performance in follow-up investigations. The corpus comprises two
components, a large dataset of samples and meta information such as
hierarchical structure and textual information on the samples. Our goal is to
select a set of samples from the corpus for the given model. The core of our
method is an objective function that considers two criteria on the selected
samples: the model functional properties (derived from the dataset), and
semantics (derived from the metadata). We also give an algorithm to efficiently
search the large space of all possible subsets w.r.t. the objective function.
Experimentation results show that the proposed method is effective. For
example, cloning a given model (originally trained with CIFAR-10) by using
Caltech 101 can achieve 45.5% accuracy. By using datasets selected by our
method, the accuracy is improved to 72.0%.
Related papers
- Data Pruning in Generative Diffusion Models [2.0111637969968]
Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets.
We show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically.
arXiv Detail & Related papers (2024-11-19T14:13:25Z) - Target-Aware Language Modeling via Granular Data Sampling [25.957424920194914]
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources.
A cost-effective and straightforward approach is sampling with low-dimensional data features.
We show that pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
arXiv Detail & Related papers (2024-09-23T04:52:17Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Post-training Model Quantization Using GANs for Synthetic Data
Generation [57.40733249681334]
We investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method.
We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images.
arXiv Detail & Related papers (2023-05-10T11:10:09Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z) - Example-Based Explainable AI and its Application for Remote Sensing
Image Classification [0.0]
We show an example of an instance in a training dataset that is similar to the input data to be inferred.
Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated.
arXiv Detail & Related papers (2023-02-03T03:48:43Z) - Spectral goodness-of-fit tests for complete and partial network data [1.7188280334580197]
We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data.
We show that our method, when applied to a specific model of interest, provides a straightforward, computationally fast way of selecting parameters.
Our method leads to improved community detection algorithms.
arXiv Detail & Related papers (2021-06-17T17:56:30Z) - Self-Supervision based Task-Specific Image Collection Summarization [3.115375810642661]
We propose a novel approach to task-specific image corpus summarization using semantic information and self-supervision.
Our method uses a classification-based Wasserstein generative adversarial network (WGAN) as a feature generating network.
The model then generates a summary at inference time by using K-means clustering in the semantic embedding space.
arXiv Detail & Related papers (2020-12-19T10:58:04Z) - Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network.
We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.