Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection
- URL: http://arxiv.org/abs/2502.08687v1
- Date: Wed, 12 Feb 2025 12:14:35 GMT
- Title: Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection
- Authors: Areeg Fahad Rasheed, M. Zarkoosh, Shimam Amer Chasib, Safa F. Abbas,
- Abstract summary: The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis.
The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base.
The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy.
- Score: 0.0
- License:
- Abstract: The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis. The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base. The models are evaluated on test sets. The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy, compared to using only the provided dataset. The full code, including model training and the augmented dataset, can be found in this repository: https://github.com/AREEG94FAHAD/food-hazard-prdouct-cls
Related papers
- Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training [1.3980986259786223]
We show that dataset diversity can impact the performance of vision models.
Our study shows positive correlations between test set accuracy and data diversity.
These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance.
arXiv Detail & Related papers (2025-01-15T00:56:59Z) - Data Diet: Can Trimming PET/CT Datasets Enhance Lesion Segmentation? [70.38903555729081]
We describe our approach to compete in the autoPET3 datacentric track.
We find that in the autoPETIII dataset, a model that is trained on the entire dataset exhibits undesirable characteristics.
We counteract this by removing the easiest samples from the training dataset as measured by the model loss before retraining from scratch.
arXiv Detail & Related papers (2024-09-20T14:47:58Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.
LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.
Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [16.654859430784825]
Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining.
We introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress.
Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks.
arXiv Detail & Related papers (2024-06-10T06:27:42Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Using GPT-4 to Augment Unbalanced Data for Automatic Scoring [0.5586073503694489]
We introduce a novel text data augmentation framework leveraging GPT-4, a generative large language model.
We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes.
We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
arXiv Detail & Related papers (2023-10-25T01:07:50Z) - Challenging the Myth of Graph Collaborative Filtering: a Reasoned and Reproducibility-driven Analysis [50.972595036856035]
We present a code that successfully replicates results from six popular and recent graph recommendation models.
We compare these graph models with traditional collaborative filtering models that historically performed well in offline evaluations.
By investigating the information flow from users' neighborhoods, we aim to identify which models are influenced by intrinsic features in the dataset structure.
arXiv Detail & Related papers (2023-08-01T09:31:44Z) - Investigating Pre-trained Language Models on Cross-Domain Datasets, a
Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks.
The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z) - DAGAM: Data Augmentation with Generation And Modification [3.063234089519162]
In pre-trained language models, under-fitting often occurs due to the size of the model being very large.
We introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models.
arXiv Detail & Related papers (2022-04-06T07:20:45Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Data from Model: Extracting Data from Non-robust and Robust Models [83.60161052867534]
This work explores the reverse process of generating data from a model, attempting to reveal the relationship between the data and the model.
We repeat the process of Data to Model (DtM) and Data from Model (DfM) in sequence and explore the loss of feature mapping information.
Our results show that the accuracy drop is limited even after multiple sequences of DtM and DfM, especially for robust models.
arXiv Detail & Related papers (2020-07-13T05:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.