DataPro -- A Standardized Data Understanding and Processing Procedure: A Case Study of an Eco-Driving Project
- URL: http://arxiv.org/abs/2501.12176v1
- Date: Tue, 21 Jan 2025 14:34:11 GMT
- Title: DataPro -- A Standardized Data Understanding and Processing Procedure: A Case Study of an Eco-Driving Project
- Authors: Zhipeng Ma, Bo Nørregaard Jørgensen, Zheng Grace Ma,
- Abstract summary: The CRISP-DM model is the de-facto standard for developing data-mining projects in practice.
This paper presents the DataPro model, which extends CRISP-DM and emphasizes the link between data scientists and stakeholders.
- Score: 0.9672182825841383
- License:
- Abstract: A systematic pipeline for data processing and knowledge discovery is essential to extracting knowledge from big data and making recommendations for operational decision-making. The CRISP-DM model is the de-facto standard for developing data-mining projects in practice. However, advancements in data processing technologies require enhancements to this framework. This paper presents the DataPro (a standardized data understanding and processing procedure) model, which extends CRISP-DM and emphasizes the link between data scientists and stakeholders by adding the "technical understanding" and "implementation" phases. Firstly, the "technical understanding" phase aligns business demands with technical requirements, ensuring the technical team's accurate comprehension of business goals. Next, the "implementation" phase focuses on the practical application of developed data science models, ensuring theoretical models are effectively applied in business contexts. Furthermore, clearly defining roles and responsibilities in each phase enhances management and communication among all participants. Afterward, a case study on an eco-driving data science project for fuel efficiency analysis in the Danish public transportation sector illustrates the application of the DataPro model. By following the proposed framework, the project identified key business objectives, translated them into technical requirements, and developed models that provided actionable insights for reducing fuel consumption. Finally, the model is evaluated qualitatively, demonstrating its superiority over other data science procedures.
Related papers
- YuLan-Mini: An Open Data-efficient Language Model [111.02822724500552]
YuLan-Mini, a highly capable base model with 2.42B parameters, achieves top-tier performance among models of similar parameter scale.
Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.
arXiv Detail & Related papers (2024-12-23T17:47:53Z) - A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - Procedure Model for Building Knowledge Graphs for Industry Applications [0.0]
The graph-based integration of previously unconnected information with domain knowledge provides new insights.
This paper presents a practical step-by-step procedure model for building an RDF knowledge graph.
arXiv Detail & Related papers (2024-09-20T11:46:37Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Towards Avoiding the Data Mess: Industry Insights from Data Mesh Implementations [1.5029560229270191]
Data mesh is a socio-technical, decentralized, distributed concept for enterprise data management.
We conduct 15 semi-structured interviews with industry experts.
Our findings synthesize insights from industry experts and provide researchers and professionals with preliminary guidelines for the successful adoption of data mesh.
arXiv Detail & Related papers (2023-02-03T13:09:57Z) - Process-BERT: A Framework for Representation Learning on Educational
Process Data [68.8204255655161]
We propose a framework for learning representations of educational process data.
Our framework consists of a pre-training step that uses BERT-type objectives to learn representations from sequential process data.
We apply our framework to the 2019 nation's report card data mining competition dataset.
arXiv Detail & Related papers (2022-04-28T16:07:28Z) - A survey study of success factors in data science projects [0.0]
Agile data science lifecycle is the most widely used framework, but only 25% of the survey participants state to follow a data science project methodology.
Professionals who adhere to a project methodology place greater emphasis on the project's potential risks and pitfalls.
arXiv Detail & Related papers (2022-01-17T09:50:46Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Data Science Methodologies: Current Challenges and Future Approaches [0.0]
Lack of vision and clear objectives, a biased emphasis on technical issues, a low level of maturity for ad-hoc projects are among these challenges.
Few methodologies offer a complete guideline across team, project and data & information management.
We propose a conceptual framework containing general characteristics that a methodology for managing data science projects with a holistic point of view should have.
arXiv Detail & Related papers (2021-06-14T10:34:50Z) - Towards CRISP-ML(Q): A Machine Learning Process Model with Quality
Assurance Methodology [53.063411515511056]
We propose a process model for the development of machine learning applications.
The first phase combines business and data understanding as data availability oftentimes affects the feasibility of the project.
The sixth phase covers state-of-the-art approaches for monitoring and maintenance of a machine learning applications.
arXiv Detail & Related papers (2020-03-11T08:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.