Minimalist Data Wrangling with Python
- URL: http://arxiv.org/abs/2211.04630v1
- Date: Wed, 9 Nov 2022 01:24:39 GMT
- Title: Minimalist Data Wrangling with Python
- Authors: Marek Gagolewski
- Abstract summary: Data Wrangling with Python is envisaged as a student's first introduction to data science.
It provides a high-level overview as well as discussing key concepts in detail.
- Score: 4.429175633425273
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Minimalist Data Wrangling with Python is envisaged as a student's first
introduction to data science, providing a high-level overview as well as
discussing key concepts in detail. We explore methods for cleaning data
gathered from different sources, transforming, selecting, and extracting
features, performing exploratory data analysis and dimensionality reduction,
identifying naturally occurring data clusters, modelling patterns in data,
comparing data between groups, and reporting the results. This textbook is a
non-profit project. Its online and PDF versions are freely available at
https://datawranglingpy.gagolewski.com/.
Related papers
- Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - From Random to Informed Data Selection: A Diversity-Based Approach to
Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning.
Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases.
This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z) - arfpy: A python package for density estimation and generative modeling
with adversarial random forests [1.3597551064547502]
This paper introduces $textitarfpy$, a python implementation of Adversarial Random Forests (ARF) (Watson et al., 2023)
It is a lightweight procedure for synthesizing new data that resembles some given data.
arXiv Detail & Related papers (2023-11-13T14:28:21Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - PyTAIL: Interactive and Incremental Learning of NLP Models with Human in
the Loop for Online Data [1.576409420083207]
PyTAIL is a python library that allows a human in the loop approach to actively train NLP models.
We simulate the performance of PyTAIL on existing social media benchmark datasets for text classification.
arXiv Detail & Related papers (2022-11-24T20:08:15Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Generating Data to Mitigate Spurious Correlations in Natural Language
Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on.
We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model.
Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z) - MusPy: A Toolkit for Symbolic Music Generation [32.01713268702699]
MusPy is an open source Python library for symbolic music generation.
In this paper, we present statistical analysis of the eleven datasets currently supported by MusPy.
arXiv Detail & Related papers (2020-08-05T06:16:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.