Related papers: A Vision for Semantically Enriched Data Science

A Vision for Semantically Enriched Data Science

URL: http://arxiv.org/abs/2303.01378v1
Date: Thu, 2 Mar 2023 16:03:12 GMT
Title: A Vision for Semantically Enriched Data Science
Authors: Udayan Khurana, Kavitha Srinivas, Sainyam Galhotra, Horst Samulowitz
Abstract summary: Key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. We envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
Score: 19.604667287258724
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent efforts in automation of machine learning or data science has achieved success in various tasks such as hyper-parameter optimization or model selection. However, key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation. Data Scientists have long leveraged common sense reasoning and domain knowledge to understand and enrich data for building predictive models. In this paper we discuss important shortcomings of current data science and machine learning solutions. We then envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation. Additionally, we discuss how semantics can assist data scientists in a new manner by helping with challenges related to trust, bias, and explainability in machine learning. Semantic annotation can also help better explore and organize large data sources.

Related papers

Autonomous Data Agents: A New Opportunity for Smart Data [50.02229219403014]
Report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems.<n>DataAgents transform complex and unstructured data into coherent and actionable knowledge.<n>We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend.
arXiv Detail & Related papers (2025-09-23T06:46:41Z)
Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning [79.75718786477638]
We exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches. We demonstrate that the more accurate energy data can improve the accuracy of structure prediction. We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction.
arXiv Detail & Related papers (2024-10-14T03:11:33Z)
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements. LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information. Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Interpretable Machine Learning for Discovery: Statistical Challenges \& Opportunities [1.2891210250935146]
We discuss and review the field of interpretable machine learning. We outline the types of discoveries that can be made using Interpretable Machine Learning. We focus on the grand challenge of how to validate these discoveries in a data-driven manner.
arXiv Detail & Related papers (2023-08-02T23:57:31Z)
Privacy-Preserving Graph Machine Learning from Data to Computation: A Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning. We first review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
A Survey on Semantics in Automated Data Science [14.331183226753547]
Data Scientists leverage common sense reasoning and domain knowledge to understand and enrich data for building predictive models. We discuss how leveraging basic semantic reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
arXiv Detail & Related papers (2022-05-16T23:16:09Z)
Automating Data Science: Prospects and Challenges [30.4496620661692]
Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. Important parts of data science are already being automated, especially in the modeling stages. Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction.
arXiv Detail & Related papers (2021-05-12T14:34:35Z)
Synthetic Data: Opening the data floodgates to enable faster, more directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data. Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community. Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z)
Principles and Practice of Explainable Machine Learning [12.47276164048813]
This report focuses on data-driven methods -- machine learning (ML) and pattern recognition models in particular. With the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models. We have undertaken a survey to help industry practitioners understand the field of explainable machine learning better.
arXiv Detail & Related papers (2020-09-18T14:50:27Z)
From Data to Knowledge to Action: A Global Enabler for the 21st Century [26.32590947516587]
A confluence of advances in the computer and mathematical sciences has unleashed unprecedented capabilities for enabling true evidence-based decision making. These capabilities are making possible the large-scale capture of data and the transformation of that data into insights and recommendations. The shift of commerce, science, education, art, and entertainment to the web makes available unprecedented quantities of structured and unstructured databases about human activities.
arXiv Detail & Related papers (2020-07-31T19:19:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.