Related papers: Data Science: a Natural Ecosystem

Data Science: a Natural Ecosystem

URL: http://arxiv.org/abs/2506.11010v1
Date: Fri, 25 Apr 2025 08:43:27 GMT
Title: Data Science: a Natural Ecosystem
Authors: Emilio Porcu, Roy El Moukari, Laurent Najman, Francisco Herrera, Horst Simon,
Abstract summary: This manuscript provides a holistic (data-centric) view of what we term essential data science.<n>Data scientists face challenges that are defined according to the missions.<n>We semantically split the essential data science into computational, and foundational.
Score: 8.870389904165705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This manuscript provides a holistic (data-centric) view of what we term essential data science, as a natural ecosystem with challenges and missions stemming from the data universe with its multiple combinations of the 5D complexities (data structure, domain, cardinality, causality, and ethics) with the phases of the data life cycle. Data agents perform tasks driven by specific goals. The data scientist is an abstract entity that comes from the logical organization of data agents with their actions. Data scientists face challenges that are defined according to the missions. We define specific discipline-induced data science, which in turn allows for the definition of pan-data science, a natural ecosystem that integrates specific disciplines with the essential data science. We semantically split the essential data science into computational, and foundational. We claim that there is a serious threat of divergence between computational and foundational data science. Especially, if no approach is taken to rate whether a data universe discovery should be useful or not. We suggest that rigorous approaches to measure the usefulness of data universe discoveries might mitigate such a divergence.

Related papers

WildSci: Advancing Scientific Reasoning from In-the-Wild Literature [50.16160754134139]
We introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature.<n>By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals.<n>Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach.
arXiv Detail & Related papers (2026-01-09T06:35:23Z)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z)
A Self-Evolving AI Agent System for Climate Science [59.08800209508371]
We introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists.<n>Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning.<n>It exhibits human-like cross-disciplinary analytical ability and proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks.
arXiv Detail & Related papers (2025-07-23T08:29:25Z)
Foundation Models for Spatio-Temporal Data Science: A Tutorial and Survey [69.0648659029394]
Spatio-Temporal (ST) data science is fundamental to understanding complex systems in domains such as urban computing, climate science, and intelligent transportation.<n>Researchers have begun exploring the concept of Spatio-Temporal Foundation Models (STFMs) to enhance adaptability and generalization across diverse ST tasks.<n>STFMs empower the entire workflow of ST data science, ranging from data sensing, management, to mining, thereby offering a more holistic and scalable approach.
arXiv Detail & Related papers (2025-03-12T09:42:18Z)
Building Machine Learning Challenges for Anomaly Detection in Science [94.24422981343699]
We present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains.<n>We present a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable.
arXiv Detail & Related papers (2025-03-03T22:54:07Z)
Causal Representation Learning in Temporal Data via Single-Parent Decoding [66.34294989334728]
Scientific research often seeks to understand the causal structure underlying high-level variables in a system. Scientists typically collect low-level measurements, such as geographically distributed temperature readings. We propose a differentiable method, Causal Discovery with Single-parent Decoding, that simultaneously learns the underlying latents and a causal graph over them.
arXiv Detail & Related papers (2024-10-09T15:57:50Z)
DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.<n>This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.<n>Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
The Future of Data Science Education [0.11566458078238004]
The School of Data Science at the University of Virginia has developed a novel model for the definition of Data Science.<n>This paper will present the core features of the model and explain how it unifies various concepts going far beyond the analytics component of AI.
arXiv Detail & Related papers (2024-07-16T15:11:54Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
A data science axiology: the nature, value, and risks of data science [0.0]
Data science is a research paradigm with an unfathomed scope, scale, complexity, and power for knowledge discovery. This paper presents an axiology of data science, its purpose, nature, importance, risks, and value for problem solving.
arXiv Detail & Related papers (2023-07-19T21:12:04Z)
Defining data science: a new field of inquiry [0.0]
Modern data science is in its infancy. Emerging slowly since 1962 and rapidly since 2000, it is one of the most active, powerful, and rapidly evolving 21st century innovations. Due to its value, power, and applicability, it is emerging in over 40 disciplines, hundreds of research areas, and thousands of applications. This research addresses this data science multiple definitions challenge by proposing the development of coherent, unified definition based on a data science reference framework.
arXiv Detail & Related papers (2023-06-28T12:58:42Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.28944613907541]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.<n>We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
Position Paper on Dataset Engineering to Accelerate Science [1.952708415083428]
In this work, we will use the token ittextdataset to designate a structured set of data built to perform a well-defined task. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets.
arXiv Detail & Related papers (2023-03-09T19:07:40Z)
Modeling Information Change in Science Communication with Semantically Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change. SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers. Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z)
Data Science: Challenges and Directions [42.98602883069444]
We review hundreds of pieces of literature which include data science in their titles. We find that the majority of the discussions essentially concern statistics, data mining, machine learning, big data, or broadly data analytics. We focus on the research and innovation challenges inspired by the nature of data science problems as complex systems.
arXiv Detail & Related papers (2020-06-28T01:49:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.