Measuring Data
- URL: http://arxiv.org/abs/2212.05129v1
- Date: Fri, 9 Dec 2022 22:10:46 GMT
- Title: Measuring Data
- Authors: Margaret Mitchell and Alexandra Sasha Luccioni and Nathan Lambert and
Marissa Gerchick and Angelina McMillan-Major and Ezinwanne Ozoani and Nazneen
Rajani and Tristan Thrush and Yacine Jernite and Douwe Kiela
- Abstract summary: We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets.
Data measurements quantify different attributes of data along common dimensions that support comparison.
We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
- Score: 79.89948814583805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We identify the task of measuring data to quantitatively characterize the
composition of machine learning data and datasets. Similar to an object's
height, width, and volume, data measurements quantify different attributes of
data along common dimensions that support comparison. Several lines of research
have proposed what we refer to as measurements, with differing terminology; we
bring some of this work together, particularly in fields of computer vision and
language, and build from it to motivate measuring data as a critical component
of responsible AI development. Measuring data aids in systematically building
and analyzing machine learning (ML) data towards specific goals and gaining
better control of what modern ML systems will learn. We conclude with a
discussion of the many avenues of future work, the limitations of data
measurements, and how to leverage these measurement approaches in research and
practice.
Related papers
- Data-driven Modeling in Metrology -- A Short Introduction, Current Developments and Future Perspectives [3.5840407154326224]
Digital technology, expansive sensor networks, and high-performance computing have led to a growing shift towards data-driven methodologies.
Here, we demonstrate the variety of opportunities that data-driven modeling presents, and how they have been already implemented in various real-world applications.
arXiv Detail & Related papers (2024-06-24T14:09:45Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - A Mechanistic Interpretation of Arithmetic Reasoning in Language Models
using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions.
This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z) - Estimating informativeness of samples with Smooth Unique Information [108.25192785062367]
We measure how much a sample informs the final weights and how much it informs the function computed by the weights.
We give efficient approximations of these quantities using a linearized network.
We apply these measures to several problems, such as dataset summarization.
arXiv Detail & Related papers (2021-01-17T10:29:29Z) - Data Quality Measures and Efficient Evaluation Algorithms for
Large-Scale High-Dimensional Data [0.15229257192293197]
We propose two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset.
We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data.
arXiv Detail & Related papers (2021-01-05T10:23:08Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z) - On the Use of Interpretable Machine Learning for the Management of Data
Quality [13.075880857448059]
We propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity.
Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets.
arXiv Detail & Related papers (2020-07-29T08:49:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.