Analysis of Knowledge Tracing performance on synthesised student data
- URL: http://arxiv.org/abs/2401.16832v1
- Date: Tue, 30 Jan 2024 09:19:50 GMT
- Title: Analysis of Knowledge Tracing performance on synthesised student data
- Authors: Panagiotis Pagonis and Kai Hartung and Di Wu and Munir Georges and
S\"oren Gr\"ottrup
- Abstract summary: Knowledge Tracing aims to predict the future performance of students by tracking the development of their knowledge states.
Despite all the recent progress made in this field, the application of KT models in education systems is still restricted from the data perspectives.
Our work shows that using only synthetic data for training can lead to similar performance as real data.
- Score: 3.9227982854973438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Tracing (KT) aims to predict the future performance of students by
tracking the development of their knowledge states. Despite all the recent
progress made in this field, the application of KT models in education systems
is still restricted from the data perspectives: 1) limited access to real life
data due to data protection concerns, 2) lack of diversity in public datasets,
3) noises in benchmark datasets such as duplicate records. To resolve these
problems, we simulated student data with three statistical strategies based on
public datasets and tested their performance on two KT baselines. While we
observe only minor performance improvement with additional synthetic data, our
work shows that using only synthetic data for training can lead to similar
performance as real data.
Related papers
- LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education [5.421088637597145]
Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data.
This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o.
We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data.
arXiv Detail & Related papers (2024-11-01T00:24:59Z) - How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples.
We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics.
When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z) - How Knowledge Distillation Mitigates the Synthetic Gap in Fair Face Recognition [0.0]
Given a pretrained Teacher model trained on a real dataset, we show that carefully utilising synthetic datasets can yield surprising results.
Using Knowledge Distillation (KD) leads to performance gains across all ethnicities and decreased bias.
This approach addresses the limitations of synthetic data training, improving both the accuracy and fairness of face recognition models.
arXiv Detail & Related papers (2024-08-30T16:35:28Z) - FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners.
FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks.
We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Transitioning from Real to Synthetic data: Quantifying the bias in model [1.6134566438137665]
This study aims to establish a trade-off between bias and fairness in the models trained using synthetic data.
We demonstrate there exist a varying levels of bias impact on models trained using synthetic data.
arXiv Detail & Related papers (2021-05-10T06:57:14Z) - Benchmarking the Benchmark -- Analysis of Synthetic NIDS Datasets [4.125187280299247]
We analyse the statistical properties of benign traffic in three of the more recent and relevant NIDS datasets.
Our results show a distinct difference of most of the considered statistical features between the synthetic datasets and two real-world datasets.
arXiv Detail & Related papers (2021-04-19T03:17:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.