In-class Data Analysis Replications: Teaching Students while Testing Science
- URL: http://arxiv.org/abs/2308.16491v2
- Date: Tue, 30 Jul 2024 22:09:02 GMT
- Title: In-class Data Analysis Replications: Teaching Students while Testing Science
- Authors: Kristina Gligoric, Tiziano Piccardi, Jake Hofman, Robert West,
- Abstract summary: In the present study, we incorporated data analysis replications in the project component of the Applied Data Analysis course taught at EPFL.
We find discrepancies between what students expect of data analysis replications and what they experience.
We identify tangible benefits of the in-class data analysis replications for scientific communities.
- Score: 16.951059542542843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Science is facing a reproducibility crisis. Previous work has proposed incorporating data analysis replications into classrooms as a potential solution. However, despite the potential benefits, it is unclear whether this approach is feasible, and if so, what the involved stakeholders-students, educators, and scientists-should expect from it. Can students perform a data analysis replication over the course of a class? What are the costs and benefits for educators? And how can this solution help benchmark and improve the state of science? In the present study, we incorporated data analysis replications in the project component of the Applied Data Analysis course (CS-401) taught at EPFL (N=354 students). Here we report pre-registered findings based on surveys administered throughout the course. First, we demonstrate that students can replicate previously published scientific papers, most of them qualitatively and some exactly. We find discrepancies between what students expect of data analysis replications and what they experience by doing them along with changes in expectations about reproducibility, which together serve as evidence of attitude shifts to foster students' critical thinking. Second, we provide information for educators about how much overhead is needed to incorporate replications into the classroom and identify concerns that replications bring as compared to more traditional assignments. Third, we identify tangible benefits of the in-class data analysis replications for scientific communities, such as a collection of replication reports and insights about replication barriers in scientific work that should be avoided going forward. Overall, we demonstrate that incorporating replication tasks into a large data science class can increase the reproducibility of scientific work as a by-product of data science instruction, thus benefiting both science and students.
Related papers
- Hypothesizing Missing Causal Variables with LLMs [55.28678224020973]
We formulate a novel task where the input is a partial causal graph with missing variables, and the output is a hypothesis about the missing variables to complete the partial graph.
We show the strong ability of LLMs to hypothesize the mediation variables between a cause and its effect.
We also observe surprising results where some of the open-source models outperform the closed GPT-4 model.
arXiv Detail & Related papers (2024-09-04T10:37:44Z) - Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations.
We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate.
Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z) - Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study [61.74571814707054]
We evaluate whether every generated sentence is grounded in retrieved documents or the model's pre-training data.
Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded.
Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations.
arXiv Detail & Related papers (2024-04-10T14:50:10Z) - Reproducibility and Geometric Intrinsic Dimensionality: An Investigation on Graph Neural Network Research [0.0]
Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality.
Using the closely linked concept of intrinsic dimension we investigate to which the used machine learning models are influenced by the extend dimension of the data sets they are trained on.
arXiv Detail & Related papers (2024-03-13T11:44:30Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Repetition In Repetition Out: Towards Understanding Neural Text
Degeneration from the Data Perspective [91.14291142262262]
This work presents a straightforward and fundamental explanation from the data perspective.
Our preliminary investigation reveals a strong correlation between the degeneration issue and the presence of repetitions in training data.
Our experiments reveal that penalizing the repetitions in training data remains critical even when considering larger model sizes and instruction tuning.
arXiv Detail & Related papers (2023-10-16T09:35:42Z) - Large Language Models for Automated Open-domain Scientific Hypotheses Discovery [50.40483334131271]
This work proposes the first dataset for social science academic hypotheses discovery.
Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity.
A multi- module framework is developed for the task, including three different feedback mechanisms to boost performance.
arXiv Detail & Related papers (2023-09-06T05:19:41Z) - The worst of both worlds: A comparative analysis of errors in learning
from data in psychology and machine learning [17.336655978572583]
Recent concerns that machine learning (ML) may be facing a misdiagnosis and replication crisis suggest that some published claims in ML research cannot be taken at face value.
A deeper understanding of what concerns in research in supervised ML have in common with the replication crisis in experimental science can put the new concerns in perspective.
arXiv Detail & Related papers (2022-03-12T18:26:24Z) - Opinionated practices for teaching reproducibility: motivation, guided
instruction and practice [0.0]
Predictive modelling is often one of the most interesting topics to novices in data science.
Students are not as intrinsically motivated to learn this topic, and it is not an easy one for them to learn.
Providing extra motivation, guided instruction and lots of practice are key to effectively teaching this topic.
arXiv Detail & Related papers (2021-09-17T19:15:41Z) - An Analytical Theory of Curriculum Learning in Teacher-Student Networks [10.303947049948107]
In humans and animals, curriculum learning is critical to rapid learning and effective pedagogy.
In machine learning, curricula are not widely used and empirically often yield only moderate benefits.
arXiv Detail & Related papers (2021-06-15T11:48:52Z) - Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction
Prediction and Synthesis Design [0.8594140167290099]
We identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction.
First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner.
Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only.
arXiv Detail & Related papers (2021-05-06T13:11:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.