Too Fine or Too Coarse? The Goldilocks Composition of Data Complexity
for Robust Left-Right Eye-Tracking Classifiers
- URL: http://arxiv.org/abs/2209.03761v1
- Date: Wed, 24 Aug 2022 23:18:08 GMT
- Title: Too Fine or Too Coarse? The Goldilocks Composition of Data Complexity
for Robust Left-Right Eye-Tracking Classifiers
- Authors: Brian Xiang and Abdelrahman Abdelmonsef
- Abstract summary: We train machine learning models utilizing a mixed dataset composed of both fine- and coarse-grain data.
For our purposes, finer-grain data refers to data collected using more complex methods whereas coarser-grain data refers to data collected using more simple methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The differences in distributional patterns between benchmark data and
real-world data have been one of the main challenges of using
electroencephalogram (EEG) signals for eye-tracking (ET) classification.
Therefore, increasing the robustness of machine learning models in predicting
eye-tracking positions from EEG data is integral for both research and consumer
use. Previously, we compared the performance of classifiers trained solely on
finer-grain data to those trained solely on coarse-grain. Results indicated
that despite the overall improvement in robustness, the performance of the
fine-grain trained models decreased, compared to coarse-grain trained models,
when the testing and training set contained the same distributional patterns
\cite{vectorbased}. This paper aims to address this case by training models
using datasets of mixed data complexity to determine the ideal distribution of
fine- and coarse-grain data. We train machine learning models utilizing a mixed
dataset composed of both fine- and coarse-grain data and then compare the
accuracies to models trained using solely fine- or coarse-grain data. For our
purposes, finer-grain data refers to data collected using more complex methods
whereas coarser-grain data refers to data collected using more simple methods.
We apply covariate distributional shifts to test for the susceptibility of each
training set. Our results indicated that the optimal training dataset for
EEG-ET classification is not composed of solely fine- or coarse-grain data, but
rather a mix of the two, leaning towards finer-grain.
Related papers
- Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification [7.357494019212501]
We propose efficient weighted-loss approaches to align synthetic data with real-world distribution.
We empirically assessed the effectiveness of our method on multiple text classification tasks.
arXiv Detail & Related papers (2024-10-28T20:53:49Z) - Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification [11.6055501181235]
We investigate the use of verification on synthesized data to prevent model collapse.
We show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse.
arXiv Detail & Related papers (2024-06-11T17:46:16Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - SynBench: Task-Agnostic Benchmarking of Pretrained Representations using
Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning.
This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Vector-Based Data Improves Left-Right Eye-Tracking Classifier
Performance After a Covariate Distributional Shift [0.0]
We propose a fine-grain data approach for EEG-ET data collection in order to create more robust benchmarking.
We train machine learning models utilizing both coarse-grain and fine-grain data and compare their accuracies when tested on data of similar/different distributional patterns.
Results showed that models trained on fine-grain, vector-based data were less susceptible to distributional shifts than models trained on coarse-grain, binary-classified data.
arXiv Detail & Related papers (2022-07-31T16:27:50Z) - Generating Data to Mitigate Spurious Correlations in Natural Language
Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on.
We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model.
Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.