Fingerprinting and Building Large Reproducible Datasets
- URL: http://arxiv.org/abs/2306.11391v1
- Date: Tue, 20 Jun 2023 08:59:33 GMT
- Title: Fingerprinting and Building Large Reproducible Datasets
- Authors: Romain Lefeuvre, Jessie Galasso, Benoit Combemale, Houari Sahraoui and
Stefano Zacchiroli
- Abstract summary: We propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their provenance.
We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
- Score: 3.2873782624127843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Obtaining a relevant dataset is central to conducting empirical studies in
software engineering. However, in the context of mining software repositories,
the lack of appropriate tooling for large scale mining tasks hinders the
creation of new datasets. Moreover, limitations related to data sources that
change over time (e.g., code bases) and the lack of documentation of extraction
processes make it difficult to reproduce datasets over time. This threatens the
quality and reproducibility of empirical studies.
In this paper, we propose a tool-supported approach facilitating the creation
of large tailored datasets while ensuring their reproducibility. We leveraged
all the sources feeding the Software Heritage append-only archive which are
accessible through a unified programming interface to outline a reproducible
and generic extraction process. We propose a way to define a unique fingerprint
to characterize a dataset which, when provided to the extraction process,
ensures that the same dataset will be extracted.
We demonstrate the feasibility of our approach by implementing a prototype.
We show how it can help reduce the limitations researchers face when creating
or reproducing datasets.
Related papers
- Imitation Learning Datasets: A Toolkit For Creating Datasets, Training
Agents and Benchmarking [0.9944647907864256]
Imitation learning field requires expert data to train agents in a task.
Most often, this learning approach suffers from the absence of available data.
This work aims to address these issues by creating Imitation Learning datasets.
arXiv Detail & Related papers (2024-03-01T14:18:46Z) - Dataset Factory: A Toolchain For Generative Computer Vision Datasets [0.9013233848500058]
We propose a "dataset factory" that separates the storage and processing of samples from metadata.
This enables data-centric operations at scale for machine learning teams and individual researchers.
arXiv Detail & Related papers (2023-09-20T19:43:37Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.