A survey study of success factors in data science projects
- URL: http://arxiv.org/abs/2201.06310v1
- Date: Mon, 17 Jan 2022 09:50:46 GMT
- Title: A survey study of success factors in data science projects
- Authors: I\~nigo Martinez, Elisabeth Viles, Igor G. Olaizola
- Abstract summary: Agile data science lifecycle is the most widely used framework, but only 25% of the survey participants state to follow a data science project methodology.
Professionals who adhere to a project methodology place greater emphasis on the project's potential risks and pitfalls.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the data science community has pursued excellence and made
significant research efforts to develop advanced analytics, focusing on solving
technical problems at the expense of organizational and socio-technical
challenges. According to previous surveys on the state of data science project
management, there is a significant gap between technical and organizational
processes. In this article we present new empirical data from a survey to 237
data science professionals on the use of project management methodologies for
data science. We provide additional profiling of the survey respondents' roles
and their priorities when executing data science projects. Based on this survey
study, the main findings are: (1) Agile data science lifecycle is the most
widely used framework, but only 25% of the survey participants state to follow
a data science project methodology. (2) The most important success factors are
precisely describing stakeholders' needs, communicating the results to
end-users, and team collaboration and coordination. (3) Professionals who
adhere to a project methodology place greater emphasis on the project's
potential risks and pitfalls, version control, the deployment pipeline to
production, and data security and privacy.
Related papers
- DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.
This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.
Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Research information in the light of artificial intelligence: quality and data ecologies [0.0]
This paper presents multi- and interdisciplinary approaches for finding the appropriate AI technologies for research information.
Professional research information management (RIM) is becoming increasingly important as an expressly data-driven tool for researchers.
arXiv Detail & Related papers (2024-05-06T16:07:56Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Why Data Science Projects Fail [0.0]
Data Science is the core of many businesses and helps businesses build smart strategies around to deal with businesses challenges more efficiently.
Data Science practice also helps in automating business processes using the algorithm, and it has several other benefits, which also deliver in a non-profitable framework.
In regards to data science, three key components primarily influence the effective outcome of a data science project.
arXiv Detail & Related papers (2023-08-08T06:45:15Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - TAPS Responsibility Matrix: A tool for responsible data science by
design [2.2973034509761816]
We describe the Transparency, Accountability, Privacy, and Societal Responsibility Matrix (TAPS-RM) as framework to explore social, legal, and ethical aspects of data science projects.
We map the developed model of TAPS-RM with well-known initiatives for open data.
We conclude that TAPS-RM is a tool to reflect on responsibilities at a data science project level and can be used to advance responsible data science by design.
arXiv Detail & Related papers (2023-02-02T12:09:14Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Data Science Methodologies: Current Challenges and Future Approaches [0.0]
Lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects are among these challenges.
Few methodologies offer a complete guideline across team, project and data & information management.
We propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have.
arXiv Detail & Related papers (2021-06-14T10:34:50Z) - Trust in Data Science: Collaboration, Translation, and Accountability in
Corporate Data Science Projects [6.730787776951012]
We describe four common tensions in applied data science work: (un)equivocal numbers, (counter)intuitive knowledge, (in)credible data, and (in)scrutable models.
We show how organizational actors establish and re-negotiate trust under messy and uncertain analytic conditions through practices of skepticism, assessment, and credibility.
We conclude by discussing the implications of our findings for data science research and practice, both within and beyond CSCW.
arXiv Detail & Related papers (2020-02-09T15:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.