Related papers: LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

URL: http://arxiv.org/abs/2411.10455v1
Date: Fri, 01 Nov 2024 00:24:59 GMT
Title: LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education
Authors: Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa, Arto Hellas,
Abstract summary: Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data.
Score: 5.421088637597145
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is a great need for data in computing education research. Data is needed to understand how students behave, to train models of student behavior to optimally support students, and to develop and validate new assessment tools and learning analytics techniques. However, relatively few computing education datasets are shared openly, often due to privacy regulations and issues in making sure the data is anonymous. Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data, which can be used to explore various aspects of student learning, develop and test educational technologies, and support research in areas where collecting real student data may be challenging or impractical. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data. Our findings suggest that LLMs can be used to generate synthetic incorrect submissions that are not significantly different from real student data with regard to test case failure distributions. Our research contributes to the development of reliable synthetic datasets for computing education research and teaching, potentially accelerating progress in the field while preserving student privacy.

Related papers

Simulation as Reality? The Effectiveness of LLM-Generated Data in Open-ended Question Assessment [7.695222586877482]
This study investigates the potential and gap of simulative data to address the limitation of AI-based assessment tools. Our findings reveal that while simulative data demonstrates promising results in training automated assessment models, its effectiveness has notable limitations. The absence of real-world noise and biases, which are also present in over-processed real-world data, contributes to this limitation.
arXiv Detail & Related papers (2025-02-10T11:40:11Z)
Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation [2.4374097382908477]
We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data.
arXiv Detail & Related papers (2025-01-03T12:52:51Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets. Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z)
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Customizing Synthetic Data for Data-Free Student Learning [6.8080936803807734]
DFKD aims to obtain a lightweight student model without original training data. To more effectively train the student model, synthetic data shall be customized to the current student learning ability. We propose Customizing Synthetic Data for Data-Free Student Learning (CSD) in this paper.
arXiv Detail & Related papers (2023-07-10T13:17:29Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Synthetic Data in Healthcare [10.555189948915492]
We present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine. We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
arXiv Detail & Related papers (2023-04-06T17:23:39Z)
Machine Learning for Synthetic Data Generation: A Review [23.073056971997715]
This paper reviews existing studies that employ machine learning models for the purpose of generating synthetic data. The review encompasses various perspectives, starting with the applications of synthetic data generation, spanning computer vision, speech, natural language processing, healthcare, and business domains. The paper also addresses the crucial aspects of privacy and fairness concerns related to synthetic data generation.
arXiv Detail & Related papers (2023-02-08T13:59:31Z)
A Survey of Learning on Small Data: Generalization, Optimization, and Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI. This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data. Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z)
Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data. Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)
DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network. We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples. We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.