Synthetic Embedding-based Data Generation Methods for Student
Performance
- URL: http://arxiv.org/abs/2101.00728v1
- Date: Sun, 3 Jan 2021 23:43:36 GMT
- Title: Synthetic Embedding-based Data Generation Methods for Student
Performance
- Authors: Dom Huh
- Abstract summary: We introduce a general framework for synthetic embedding-based data generation (SEDG)
SEDG is a search-based approach to generate new synthetic samples using embeddings to correct the detriment effects of class imbalances optimally.
We find SEDG to outperform the traditional re-sampling methods for deep neural networks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the inherent class imbalance issue within student performance datasets,
samples belonging to the edges of the target class distribution pose a
challenge for predictive machine learning algorithms to learn. In this paper,
we introduce a general framework for synthetic embedding-based data generation
(SEDG), a search-based approach to generate new synthetic samples using
embeddings to correct the detriment effects of class imbalances optimally. We
compare the SEDG framework to past synthetic data generation methods, including
deep generative models, and traditional sampling methods. In our results, we
find SEDG to outperform the traditional re-sampling methods for deep neural
networks and perform competitively for common machine learning classifiers on
the student performance task in several standard performance metrics.
Related papers
- Deep Learning Meets Oversampling: A Learning Framework to Handle Imbalanced Classification [0.0]
We propose a novel learning framework that can generate synthetic data instances in a data-driven manner.
The proposed framework formulates the oversampling process as a composition of discrete decision criteria.
Experiments on the imbalanced classification task demonstrate the superiority of our framework over state-of-the-art algorithms.
arXiv Detail & Related papers (2025-02-08T13:35:00Z) - Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches [35.431340001608476]
This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning.
It aims to tackle the challenges posed by small-sample data in fields such as drug discovery, target recognition, and malicious traffic detection.
Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning.
arXiv Detail & Related papers (2024-11-25T16:51:11Z) - FuseGen: PLM Fusion for Data-generation based Zero-shot Learning [18.51772808242954]
FuseGen is a novel data generation-based zero-shot learning framework.
It introduces a new criteria for subset selection from synthetic datasets.
The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
arXiv Detail & Related papers (2024-06-18T11:55:05Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Downstream Task-Oriented Generative Model Selections on Synthetic Data
Training for Fraud Detection Models [9.754400681589845]
In this paper, we approach the downstream task-oriented generative model selections problem in the case of training fraud detection models.
Our investigation supports that, while both Neural Network(NN)-based and Bayesian Network(BN)-based generative models are both good to complete synthetic training task under loose model interpretability constrain, the BN-based generative models is better than NN-based when synthetic training fraud detection model under strict model interpretability constrain.
arXiv Detail & Related papers (2024-01-01T23:33:56Z) - Domain Generalization Guided by Gradient Signal to Noise Ratio of
Parameters [69.24377241408851]
Overfitting to the source domain is a common issue in gradient-based training of deep neural networks.
We propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters.
arXiv Detail & Related papers (2023-10-11T10:21:34Z) - Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class
Balancing and Low Resource Settings [0.0]
The study employed the Generalised Linear Model (GLM) algorithm for class balancing experiments.
In low-resource experiments, models trained on data enhanced with GAN-synthesized data exhibited better recall values than original data.
arXiv Detail & Related papers (2023-06-24T10:27:08Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Revisiting LSTM Networks for Semi-Supervised Text Classification via
Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results.
We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.