The DataSquad Experiment: Lessons for Preparing Data and Computer Scientists for Work
- URL: http://arxiv.org/abs/2511.19688v1
- Date: Mon, 24 Nov 2025 20:49:28 GMT
- Title: The DataSquad Experiment: Lessons for Preparing Data and Computer Scientists for Work
- Authors: Paula Lackie, Elliot Pickens, Dashiell Coyier,
- Abstract summary: DataSquad at Carleton College trains undergraduates through structured peer mentorship and real client projects.<n>This paper describes the program's implementation at Carleton College and examines how structured peer mentorship can simultaneously improve institutional data services.
- Score: 0.746464933382582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The DataSquad at Carleton College addresses a common problem at small liberal arts colleges: limited capacity for data services and few opportunities for students to gain practical experience with data and software development. Academic Technologist Paula Lackie designed the program as a work-study position that trains undergraduates through structured peer mentorship and real client projects. Students tackle data problems of increasing complexity-from basic data analysis to software development-while learning FAIR data principles and open science practices. The model's core components (peer mentorship structure, project-based learning, and communication training) make it adaptable to other institutions. UCLA and other colleges have adopted the model using openly shared materials through "DataSquad International." This paper describes the program's implementation at Carleton College and examines how structured peer mentorship can simultaneously improve institutional data services and provide students with professional skills and confidence.
Related papers
- Learning to Solve Complex Problems via Dataset Decomposition [53.1641602054716]
This research explores a reverse curriculum generation approach that decomposes complex datasets into simpler, more learnable components.<n>We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to generate easier versions of examples.
arXiv Detail & Related papers (2026-02-23T19:25:40Z) - Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z) - Compressive Meta-Learning [49.300635370079874]
Compressive learning is a framework that enables efficient processing by using random, non-linear features.<n>We propose a framework that meta-learns both the encoding and decoding stages of compressive learning methods.<n>We explore multiple applications -- including neural network-based compressive PCA, compressive ridge regression, compressive k-means, and autoencoders.
arXiv Detail & Related papers (2025-08-14T22:08:06Z) - A collaborative digital twin built on FAIR data and compute infrastructure [41.94295877935867]
This work presents a distributed SDL implementation built on nanoHUB services for online simulation and FAIR data management.<n>Researchers and students can set up their own experiments, share data with collaborators, and explore the combination of FAIR data, predictive ML models, and sequential optimization.
arXiv Detail & Related papers (2025-06-24T18:13:52Z) - LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education [5.421088637597145]
Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data.
This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o.
We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data.
arXiv Detail & Related papers (2024-11-01T00:24:59Z) - Integrating HCI Datasets in Project-Based Machine Learning Courses: A College-Level Review and Case Study [0.7499722271664147]
This study explores the integration of real-world machine learning (ML) projects using human-computer interfaces (HCI) datasets in college-level courses.
arXiv Detail & Related papers (2024-08-06T23:05:15Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - Practical Vertical Federated Learning with Unsupervised Representation
Learning [47.77625754666018]
Federated learning enables multiple parties to collaboratively train a machine learning model without sharing their raw data.
We propose a novel communication-efficient vertical federated learning algorithm named FedOnce, which requires only one-shot communication among parties.
Our privacy-preserving technique significantly outperforms the state-of-the-art approaches under the same privacy budget.
arXiv Detail & Related papers (2022-08-13T08:41:32Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z) - Distributed Deep Learning in Open Collaborations [49.240611132653456]
We propose a novel algorithmic framework designed specifically for collaborative training.
We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost.
arXiv Detail & Related papers (2021-06-18T16:23:13Z) - From Distributed Machine Learning to Federated Learning: A Survey [49.7569746460225]
Federated learning emerges as an efficient approach to exploit distributed data and computing resources.
We propose a functional architecture of federated learning systems and a taxonomy of related techniques.
We present the distributed training, data communication, and security of FL systems.
arXiv Detail & Related papers (2021-04-29T14:15:11Z) - Enabling collaborative data science development with the Ballet
framework [9.424574945499844]
We present a novel conceptual framework and ML programming model to address challenges to scaling data science collaborations.
We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science.
arXiv Detail & Related papers (2020-12-14T18:51:23Z) - Computational Skills by Stealth in Secondary School Data Science [16.960800464621993]
We discuss a proposal for the stealth development of computational skills in students' first exposure to data science.
The intent of this approach is to support students, regardless of interest and self-efficacy in coding, in becoming data-driven learners.
arXiv Detail & Related papers (2020-10-08T09:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.