What makes an Expert? Comparing Problem-solving Practices in Data Science Notebooks
- URL: http://arxiv.org/abs/2602.15428v1
- Date: Tue, 17 Feb 2026 08:45:23 GMT
- Title: What makes an Expert? Comparing Problem-solving Practices in Data Science Notebooks
- Authors: Manuel Valle Torre, Marcus Specht, Catharine Oertel,
- Abstract summary: Development of data science expertise requires tacit, process-oriented skills that are difficult to teach directly.<n>This study addresses the resulting challenge of empirically understanding how the problem-solving processes of experts and novices differ.
- Score: 0.6308539010172308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of data science expertise requires tacit, process-oriented skills that are difficult to teach directly. This study addresses the resulting challenge of empirically understanding how the problem-solving processes of experts and novices differ. We apply a multi-level sequence analysis to 440 Jupyter notebooks from a public dataset, mapping low-level coding actions to higher-level problem-solving practices. Our findings reveal that experts do not follow fundamentally different transitions between data science phases than novices (e.g., Data Import, EDA, Model Training, Visualization). Instead, expertise is distinguished by the overall workflow structure from a problem-solving perspective and cell-level, fine-grained action patterns. Novices tend to follow long, linear processes, whereas experts employ shorter, more iterative strategies enacted through efficient, context-specific action sequences. These results provide data science educators with empirical insights for curriculum design and assessment, shifting the focus from final products toward the development of the flexible, iterative thinking that defines expertise-a priority in a field increasingly shaped by AI tools.
Related papers
- Transforming Behavioral Neuroscience Discovery with In-Context Learning and AI-Enhanced Tensor Methods [5.319819085855185]
We showcase an example AI-enhanced pipeline designed to transform and accelerate the way that the domain experts in the team are able to gain insights out of experimental data.<n>The application at hand is in the domain of behavioral neuroscience, studying fear generalization in mice.<n>We identify the emerging paradigm of "In-Context Learning" (ICL) as a suitable interface for domain experts to automate parts of their pipeline without the need for or familiarity with AI model training and fine-tuning.
arXiv Detail & Related papers (2026-02-19T02:47:46Z) - Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation [18.99847259801634]
We propose Reinforcement Learning from Augmented Generation (RLAG) to embed domain knowledge into large language models.<n>Our approach iteratively cycles between sampling generations and optimize the model through calculated rewards.<n> Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches.
arXiv Detail & Related papers (2025-09-24T14:30:16Z) - DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation [59.79833777420334]
Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems.<n>We develop a novel inference-time optimization framework, referred to as DSMentor, to enhance LLM agent performance.<n>Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference.
arXiv Detail & Related papers (2025-05-20T10:16:21Z) - Pre-training Multi-task Contrastive Learning Models for Scientific
Literature Understanding [52.723297744257536]
Pre-trained language models (LMs) have shown effectiveness in scientific literature understanding tasks.
We propose a multi-task contrastive learning framework, SciMult, to facilitate common knowledge sharing across different literature understanding tasks.
arXiv Detail & Related papers (2023-05-23T16:47:22Z) - Experts in the Loop: Conditional Variable Selection for Accelerating
Post-Silicon Analysis Based on Deep Learning [6.6357750579293935]
Post-silicon validation is one of the most critical processes in semiconductor manufacturing.
This work aims to design a novel conditional variable selection approach while keeping experts in the loop.
arXiv Detail & Related papers (2022-09-30T06:12:12Z) - Decision Rule Elicitation for Domain Adaptation [93.02675868486932]
Human-in-the-loop machine learning is widely used in artificial intelligence (AI) to elicit labels from experts.
In this work, we allow experts to additionally produce decision rules describing their decision-making.
We show that decision rule elicitation improves domain adaptation of the algorithm and helps to propagate expert's knowledge to the AI model.
arXiv Detail & Related papers (2021-02-23T08:07:22Z) - Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap.
We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert.
Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z) - Principles and Practice of Explainable Machine Learning [12.47276164048813]
This report focuses on data-driven methods -- machine learning (ML) and pattern recognition models in particular.
With the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models.
We have undertaken a survey to help industry practitioners understand the field of explainable machine learning better.
arXiv Detail & Related papers (2020-09-18T14:50:27Z) - A Review of Meta-level Learning in the Context of Multi-component,
Multi-level Evolving Prediction Systems [6.810856082577402]
The exponential growth of volume, variety and velocity of data is raising the need for investigations of automated or semi-automated ways to extract useful patterns from the data.
It requires deep expert knowledge and extensive computational resources to find the most appropriate mapping of learning methods for a given problem.
There is a need for an intelligent recommendation engine that can advise what is the best learning algorithm for a dataset.
arXiv Detail & Related papers (2020-07-17T14:14:37Z) - Expertise Style Transfer: A New Task Towards Better Communication
between Experts and Laymen [88.30492014778943]
We propose a new task of expertise style transfer and contribute a manually annotated dataset.
Solving this task not only simplifies the professional language, but also improves the accuracy and expertise level of laymen descriptions.
We establish the benchmark performance of five state-of-the-art models for style transfer and text simplification.
arXiv Detail & Related papers (2020-05-02T04:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.