Lightweight Knowledge Representations for Automating Data Analysis
- URL: http://arxiv.org/abs/2311.12848v1
- Date: Sun, 15 Oct 2023 06:44:45 GMT
- Title: Lightweight Knowledge Representations for Automating Data Analysis
- Authors: Marko Sterbentz, Cameron Barrie, Donna Hooshmand, Shubham Shahi,
Abhratanu Dutta, Harper Pack, Andong Li Zhao, Andrew Paley, Alexander
Einarsson, Kristian Hammond
- Abstract summary: We take the first steps towards automating a key aspect of the data science pipeline: data analysis.
We present an taxonomy of data analytic operations that scopes analytics across domains and data, as well as a method for codifying domain-specific knowledge that links this taxonomy to actual data.
In this way, we produce information spaces over data that enable complex analyses and search over this data scopes and pave the way for fully automated data analysis.
- Score: 33.094930396228676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The principal goal of data science is to derive meaningful information from
data. To do this, data scientists develop a space of analytic possibilities and
from it reach their information goals by using their knowledge of the domain,
the available data, the operations that can be performed on those data, the
algorithms/models that are fed the data, and how all of these facets
interweave. In this work, we take the first steps towards automating a key
aspect of the data science pipeline: data analysis. We present an extensible
taxonomy of data analytic operations that scopes across domains and data, as
well as a method for codifying domain-specific knowledge that links this
analytics taxonomy to actual data. We validate the functionality of our
analytics taxonomy by implementing a system that leverages it, alongside domain
labelings for 8 distinct domains, to automatically generate a space of
answerable questions and associated analytic plans. In this way, we produce
information spaces over data that enable complex analyses and search over this
data and pave the way for fully automated data analysis.
Related papers
- Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics.
The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z) - Empowering Data Mesh with Federated Learning [5.087058648342379]
New paradigm, Data Mesh, treats domains as a first-class concern by distributing the data ownership from the central team to each data domain.
Many multi-million dollar organizations like Paypal, Netflix, and Zalando have already transformed their data analysis pipelines based on this new architecture.
We introduce a pioneering approach that incorporates Federated Learning into Data Mesh.
arXiv Detail & Related papers (2024-03-26T17:10:15Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Demonstration of InsightPilot: An LLM-Empowered Automated Data
Exploration System [48.62158108517576]
We introduce InsightPilot, an automated data exploration system designed to simplify the data exploration process.
InsightPilot automatically selects appropriate analysis intents, such as understanding, summarizing, and explaining.
In brief, an IQuery is an abstraction and automation of data analysis operations, which mimics the approach of data analysts.
arXiv Detail & Related papers (2023-04-02T07:27:49Z) - PADME-SoSci: A Platform for Analytics and Distributed Machine Learning
for the Social Sciences [4.294774517325059]
PADME is a distributed analytics tool that federates model implementation and training.
It enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location.
arXiv Detail & Related papers (2023-03-27T15:32:35Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Paradigm selection for Data Fusion of SAR and Multispectral Sentinel
data applied to Land-Cover Classification [63.072664304695465]
In this letter, four data fusion paradigms, based on Convolutional Neural Networks (CNNs) are analyzed and implemented.
The goals are to provide a systematic procedure for choosing the best data fusion framework, resulting in the best classification results.
The procedure has been validated for land-cover classification but it can be transferred to other cases.
arXiv Detail & Related papers (2021-06-18T11:36:54Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Towards an Integrated Platform for Big Data Analysis [4.5257812998381315]
This paper presents the vision of an integrated plat-form for big data analysis that combines all these aspects.
Main benefits of this approach are an enhanced scalability of the whole platform, a better parameterization of algorithms, and an improved usability during the end-to-end data analysis process.
arXiv Detail & Related papers (2020-04-27T03:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.