Data-Centric AI Requires Rethinking Data Notion
- URL: http://arxiv.org/abs/2110.02491v2
- Date: Thu, 7 Oct 2021 06:37:07 GMT
- Title: Data-Centric AI Requires Rethinking Data Notion
- Authors: Mustafa Hajij, Ghada Zamzmi, Karthikeyan Natesan Ramamurthy, Aldo
Guzman Saenz
- Abstract summary: This work proposes unifying principles offered by categorical and cochain notions of data.
In the categorical notion, data is viewed as a mathematical structure that we act upon via morphisms to preserve this structure.
As for cochain notion, data can be viewed as a function defined in a discrete domain of interest and acted upon via operators.
- Score: 12.595006823256687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The transition towards data-centric AI requires revisiting data notions from
mathematical and implementational standpoints to obtain unified data-centric
machine learning packages. Towards this end, this work proposes unifying
principles offered by categorical and cochain notions of data, and discusses
the importance of these principles in data-centric AI transition. In the
categorical notion, data is viewed as a mathematical structure that we act upon
via morphisms to preserve this structure. As for cochain notion, data can be
viewed as a function defined in a discrete domain of interest and acted upon
via operators. While these notions are almost orthogonal, they provide a
unifying definition to view data, ultimately impacting the way machine learning
packages are developed, implemented, and utilized by practitioners.
Related papers
- Towards Data Valuation via Asymmetric Data Shapley [17.521840311921274]
We extend the traditional data Shapley framework to asymmetric data Shapley.
We introduce an efficient $k$-nearest neighbor-based algorithm for its exact computation.
We demonstrate the practical applicability of our framework across various machine learning tasks and data market contexts.
arXiv Detail & Related papers (2024-11-01T06:28:38Z) - Prospector Heads: Generalized Feature Attribution for Large Models & Data [82.02696069543454]
We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods.
We demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data.
arXiv Detail & Related papers (2024-02-18T23:01:28Z) - Surprisal Driven $k$-NN for Robust and Interpretable Nonparametric
Learning [1.4293924404819704]
We shed new light on the traditional nearest neighbors algorithm from the perspective of information theory.
We propose a robust and interpretable framework for tasks such as classification, regression, density estimation, and anomaly detection using a single model.
Our work showcases the architecture's versatility by achieving state-of-the-art results in classification and anomaly detection.
arXiv Detail & Related papers (2023-11-17T00:35:38Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Tackling Computational Heterogeneity in FL: A Few Theoretical Insights [68.8204255655161]
We introduce and analyse a novel aggregation framework that allows for formalizing and tackling computational heterogeneous data.
Proposed aggregation algorithms are extensively analyzed from a theoretical, and an experimental prospective.
arXiv Detail & Related papers (2023-07-12T16:28:21Z) - Data-Centric Artificial Intelligence [2.5874041837241304]
Data-centric artificial intelligence (data-centric AI) represents an emerging paradigm emphasizing that the systematic design and engineering of data is essential for building effective and efficient AI-based systems.
We define relevant terms, provide key characteristics to contrast the data-centric paradigm to the model-centric one, and introduce a framework for data-centric AI.
arXiv Detail & Related papers (2022-12-22T16:41:03Z) - Improved Representation Learning Through Tensorized Autoencoders [7.056005298953332]
Autoencoders (AE) are widely used in practice for unsupervised representation learning.
We propose a meta-algorithm that can be used to extend an arbitrary AE architecture to a tensorized version (TAE)
We prove that TAE can recover the principle components of the different clusters in contrast to principle component of the entire data recovered by a standard AE.
arXiv Detail & Related papers (2022-12-02T09:29:48Z) - Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks.
Data augmentation induces these symmetries during training by applying multiple transformations to the input data.
This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.