Related papers: Further Evidence on a Controversial Topic about Human-Based Experiments: Professionals vs. Students

Further Evidence on a Controversial Topic about Human-Based Experiments: Professionals vs. Students

URL: http://arxiv.org/abs/2506.11597v1
Date: Fri, 13 Jun 2025 09:05:36 GMT
Title: Further Evidence on a Controversial Topic about Human-Based Experiments: Professionals vs. Students
Authors: Simone Romano, Francesco Paolo Sferratore, Giuseppe Scanniello,
Abstract summary: We compare 62 students and 42 software professionals on a bug-fixing task on the same Java program.<n>Considering the differences between the two groups of participants, the gathered data show that the students outperformed the professionals in fixing bugs.
Score: 3.358019319437577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most Software Engineering (SE) human-based controlled experiments rely on students as participants, raising concerns about their external validity. Specifically, the realism of results obtained from students and their applicability to the software industry remains in question. In this short paper, we bring further evidence on this controversial point. To do so, we compare 62 students and 42 software professionals on a bug-fixing task on the same Java program. The students were enrolled in a Bachelor's program in Computer Science, while the professionals were employed by two multinational companies (for one of them, the professionals were from two offices). Some variations in the experimental settings of the two groups (students and professionals) were present. For instance, the experimental environment of the experiment with professionals was more realistic; i.e., they faced some stress factors such as interruptions during the bug-fixing task. Considering the differences between the two groups of participants, the gathered data show that the students outperformed the professionals in fixing bugs. This diverges to some extent from past empirical evidence. Rather than presenting definitive conclusions, our results aim to catalyze the discussion on the use of students in experiments and pave the way for future investigations. Specifically, our results encourage us to examine the complex factors influencing SE tasks, making experiments as more realistic as possible.

Related papers

MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [128.2992631982687]
We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones.<n>We propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis.<n>We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator.
arXiv Detail & Related papers (2025-05-23T13:24:50Z)
On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations [53.0667196725616]
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment.<n>DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games.<n>Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist.
arXiv Detail & Related papers (2025-03-28T16:25:06Z)
Debiasing Architectural Decision-Making: An Experiment With Students and Practitioners [2.9767565026354195]
This study was to design and evaluate a debiasing workshop with individuals at various stages of their professional careers.<n>We found that the workshop had a more substantial impact on practitioners.<n>We assume that the practitioners' attachment to their systems may be the cause of their susceptibility to biases.
arXiv Detail & Related papers (2025-02-06T12:12:53Z)
On (Mis)perceptions of testing effectiveness: an empirical study [1.8026347864255505]
This research aims to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. In the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results of the replicated study confirm the findings of the original study and suggest that participants' perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.
arXiv Detail & Related papers (2024-02-11T14:50:01Z)
Overwhelmed software developers: An Interpretative Phenomenological Analysis [43.18707677931078]
We interviewed two software developers who have experienced overwhelm recently. We uncover seven categories of overwhelm (communication, disturbance, organizational, variety, technical, temporal, and positive overwhelm) Participants reported that overwhelm can sometimes be experienced to be positive and pleasant, and it can increase their mental focus, self ambition, and productivity.
arXiv Detail & Related papers (2024-01-05T12:39:08Z)
PyExperimenter: Easily distribute experiments and track results [63.871474825689134]
PyExperimenter is a tool to facilitate the setup, documentation, execution, and subsequent evaluation of results from an empirical study of algorithms. It is intended to be used by researchers in the field of artificial intelligence, but is not limited to those.
arXiv Detail & Related papers (2023-01-16T10:43:02Z)
Fair Effect Attribution in Parallel Online Experiments [57.13281584606437]
A/B tests serve the purpose of reliably identifying the effect of changes introduced in online services. It is common for online platforms to run a large number of simultaneous experiments by splitting incoming user traffic randomly. Despite a perfect randomization between different groups, simultaneous experiments can interact with each other and create a negative impact on average population outcomes.
arXiv Detail & Related papers (2022-10-15T17:15:51Z)
Monitor++?: Multiple versus Single Laboratory Monitors in Early Programming Education [4.797216015572358]
This paper presents an empirical study of an introductory-level programming course with students using multiple monitors. It compares their performance and self-reported experiences versus students using a single monitor.
arXiv Detail & Related papers (2021-08-13T14:56:04Z)
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)
Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)
Online Peer-Assessment Datasets [0.0]
Peer-assessment experiments were conducted among first and second year students at the University of Trento. The experiments spanned an entire semester and were conducted in five computer science courses between 2013 and 2016. The datasets are reported as parsable data structures that, with intermediate processing, can be moulded into NLP or ML-ready datasets.
arXiv Detail & Related papers (2019-12-30T18:48:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.