Principles for data analysis workflows
- URL: http://arxiv.org/abs/2007.08708v1
- Date: Fri, 17 Jul 2020 01:17:37 GMT
- Title: Principles for data analysis workflows
- Authors: Sara Stoudt, Valeri N. Vasquez, Ciera C. Martinez
- Abstract summary: We elaborate basic principles of a reproducible data analysis workflow by defining three phases: the Exploratory, Refinement, and Polishing Phases.
We draw analogies between principles for data-intensive research and established practice in software development.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional data science education often omits training on research
workflows: the process that moves a scientific investigation from raw data to
coherent research question to insightful contribution. In this paper, we
elaborate basic principles of a reproducible data analysis workflow by defining
three phases: the Exploratory, Refinement, and Polishing Phases. Each workflow
phase is roughly centered around the audience to whom research decisions,
methodologies, and results are being immediately communicated. Importantly,
each phase can also give rise to a number of research products beyond
traditional academic publications. Where relevant, we draw analogies between
principles for data-intensive research workflows and established practice in
software development. The guidance provided here is not intended to be a strict
rulebook; rather, the suggestions for practices and tools to advance
reproducible, sound data-intensive analysis may furnish support for both
students and current professionals.
Related papers
- Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - Best Practices For Empirical Meta-Algorithmic Research: Guidelines from the COSEAL Research Network [46.56867772369597]
Best practices for meta-algorithmic research exist, but they are scattered between different publications and fields.<n>This report collects good practices for empirical meta-algorithmic research across the subfields of the COSEAL community.<n>It establishes the current state-of-the-art practices within meta-algorithmic research and serves as a guideline to both new researchers and practitioners in meta-algorithmic fields.
arXiv Detail & Related papers (2025-12-18T12:59:45Z) - Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework [7.681506465886571]
We propose an end-to-end framework that generates comprehensive, structured research papers by mining full-text academic papers.<n>We use Flan-T5 with prompt learning to generate workflow phrases from paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGEL scores of 0.4543, 0.2877, and 0.4427, respectively.<n>This approach reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies.
arXiv Detail & Related papers (2025-09-16T10:59:23Z) - Large Language Models in the Data Science Lifecycle: A Systematic Mapping Study [0.0]
Large Language Models (LLMs) have emerged as transformative tools across numerous domains.<n>This systematic mapping study comprehensively examines the application of LLMs throughout the Data Science lifecycle.
arXiv Detail & Related papers (2025-08-12T23:20:10Z) - A Comprehensive Survey on Imbalanced Data Learning [56.65067795190842]
imbalanced data is prevalent in various types of raw data and hinders the performance of machine learning.<n>This survey systematically analyzes various real-world data formats.<n>It concludes existing researches for different data formats into four categories: data re-balancing, feature representation, training strategy, and ensemble learning.
arXiv Detail & Related papers (2025-02-13T04:53:17Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - DISCOVER: A Data-driven Interactive System for Comprehensive Observation, Visualization, and ExploRation of Human Behaviour [6.716560115378451]
We introduce a modular, flexible, yet user-friendly software framework specifically developed to streamline computational-driven data exploration for human behavior analysis.
Our primary objective is to democratize access to advanced computational methodologies, thereby enabling researchers across disciplines to engage in detailed behavioral analysis without the need for extensive technical proficiency.
arXiv Detail & Related papers (2024-07-18T11:28:52Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Everywhere & Nowhere: Envisioning a Computing Continuum for Science [21.111766975909752]
Emerging data-driven scientific are seeking to leverage distributed data sources to understand end-to-end phenomena, drive experimentation, and facilitate important decision-making.
This paper explores a computing that is everywhere and nowhere -- one spanning resources at the edges, in the core, and in between, and providing abstractions that can be harnessed to support science.
It also introduces recent research in programming abstractions that can express what data should be processed and when and where it should be processed, and autonomic services that automate the discovery of resources and the orchestration of computations across these resources.
arXiv Detail & Related papers (2024-06-06T20:07:31Z) - Toward Unified Practices in Trajectory Prediction Research on Drone Datasets [3.1406146587437904]
The availability of high-quality datasets is crucial for the development of behavior prediction algorithms in autonomous vehicles.
This paper highlights the need to standardize the use of certain datasets for motion forecasting research.
We propose a set of tools and practices to achieve this.
arXiv Detail & Related papers (2024-05-01T16:17:39Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - A Field Guide to Federated Optimization [161.3779046812383]
Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data.
This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms.
arXiv Detail & Related papers (2021-07-14T18:09:08Z) - Deep Learning Schema-based Event Extraction: Literature Review and
Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot.
This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z) - Data Vision: Learning to See Through Algorithmic Abstraction [6.730787776951012]
Learning to see through data is central to contemporary forms of algorithmic knowledge production.
This paper examines how the often-divergent demands of mechanization and discretion manifest in data analytic learning environments.
arXiv Detail & Related papers (2020-02-09T15:46:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.