Data Portraits: Recording Foundation Model Training Data
- URL: http://arxiv.org/abs/2303.03919v2
- Date: Thu, 14 Dec 2023 16:55:42 GMT
- Title: Data Portraits: Recording Foundation Model Training Data
- Authors: Marc Marone, Benjamin Van Durme
- Abstract summary: Data Portraits are artifacts that record training data and allow for downstream inspection.
We document a popular language modeling corpus and a recently released code modeling dataset.
Our tool is lightweight and fast, costing only 3% of the dataset size in overhead.
- Score: 47.03896259762976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models are trained on increasingly immense and opaque datasets.
Even while these models are now key in AI system building, it can be difficult
to answer the straightforward question: has the model already encountered a
given example during training? We therefore propose a widespread adoption of
Data Portraits: artifacts that record training data and allow for downstream
inspection. First we outline the properties of such an artifact and discuss how
existing solutions can be used to increase transparency. We then propose and
implement a solution based on data sketching, stressing fast and space
efficient querying. Using our tools, we document a popular language modeling
corpus (The Pile) and a recently released code modeling dataset (The Stack). We
show that our solution enables answering questions about test set leakage and
model plagiarism. Our tool is lightweight and fast, costing only 3% of the
dataset size in overhead. We release a live interface of our tools at
https://dataportraits.org/ and call on dataset and model creators to release
Data Portraits as a complement to current documentation practices.
Related papers
- How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data [26.836532205017104]
We find that many datasets suffer from severe data leakage.
This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data.
We present XCoder, a family of models finetuned from LLaMA3.
arXiv Detail & Related papers (2024-09-05T17:46:30Z) - Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - VQA Training Sets are Self-play Environments for Generating Few-shot Pools [2.556825820539693]
We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards.
The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set.
Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets.
arXiv Detail & Related papers (2024-05-30T07:38:58Z) - Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [87.61900472933523]
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation.
We scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data.
We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos.
arXiv Detail & Related papers (2024-01-19T18:59:52Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Soft Labels for Rapid Satellite Object Detection [0.0]
We propose using satellite object detections as the basis for a new dataset of soft labels.
We show that soft labels can be used to train a model that is almost as accurate as a model trained on the original data.
arXiv Detail & Related papers (2022-12-01T15:23:13Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Generative Models as a Data Source for Multiview Representation Learning [38.56447220165002]
Generative models are capable of producing realistic images that look nearly indistinguishable from the data on which they are trained.
This raises the question: if we have good enough generative models, do we still need datasets?
We investigate this question in the setting of learning general-purpose visual representations from a black-box generative model.
arXiv Detail & Related papers (2021-06-09T17:54:55Z) - Detection and Segmentation of Custom Objects using High Distraction
Photorealistic Synthetic Data [0.5076419064097732]
We show a straightforward and useful methodology for performing instance segmentation using synthetic data.
The goal is to achieve high performance on manually-gathered and annotated real-world data of custom objects.
This white-paper provides strong evidence that photorealistic simulated data can be used in practical real world applications.
arXiv Detail & Related papers (2020-07-28T16:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.