CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain
Performance and Calibration
- URL: http://arxiv.org/abs/2309.07822v3
- Date: Tue, 13 Feb 2024 10:52:52 GMT
- Title: CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain
Performance and Calibration
- Authors: Rachneet Sachdeva, Martin Tutek, Iryna Gurevych
- Abstract summary: We show that data augmentation consistently enhances OOD performance.
We also show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance.
- Score: 59.48235003469116
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In recent years, large language models (LLMs) have shown remarkable
capabilities at scale, particularly at generating text conditioned on a prompt.
In our work, we investigate the use of LLMs to augment training data of small
language models~(SLMs) with automatically generated counterfactual~(CF)
instances -- i.e. minimally altered inputs -- in order to improve
out-of-domain~(OOD) performance of SLMs in the extractive question
answering~(QA) setup. We show that, across various LLM generators, such data
augmentation consistently enhances OOD performance and improves model
calibration for both confidence-based and rationale-augmented calibrator
models. Furthermore, these performance improvements correlate with higher
diversity of CF instances in terms of their surface form and semantic content.
Finally, we show that CF augmented models which are easier to calibrate also
exhibit much lower entropy when assigning importance, indicating that
rationale-augmented calibrators prefer concise explanations.
Related papers
- ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model [9.1108256816605]
We propose a method to improve model representation and processing efficiency by replacing the tokenizers of large language models (LLMs)
Our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.
arXiv Detail & Related papers (2024-10-06T03:01:07Z) - Structuring a Training Strategy to Robustify Perception Models with Realistic Image Augmentations [1.5723316845301678]
This report introduces a novel methodology for training with augmentations to enhance model robustness and performance in such conditions.
We present a comprehensive framework that includes identifying weak spots in Machine Learning models, selecting suitable augmentations, and devising effective training strategies.
Experimental results demonstrate improvements in model performance, as measured by commonly used metrics such as mean Average Precision (mAP) and mean Intersection over Union (mIoU) on open-source object detection and semantic segmentation models and datasets.
arXiv Detail & Related papers (2024-08-30T14:15:48Z) - Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries.
This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines.
We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z) - Effective internal language model training and fusion for factorized transducer model [26.371223360905557]
Internal language model (ILM) of the neural transducer has been widely studied.
We propose a novel ILM training and decoding strategy for factorized transducer models.
arXiv Detail & Related papers (2024-04-02T08:01:05Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Preserving Pre-trained Features Helps Calibrate Fine-tuned Language
Models [23.881825575095945]
Large pre-trained language models (PLMs) have demonstrated strong performance on natural language understanding (NLU) tasks through fine-tuning.
However, fine-tuned models still suffer from overconfident predictions, especially in out-of-domain settings.
We demonstrate that the PLMs are well-calibrated on the masked language modeling task with robust predictive confidence under domain shift.
We show that preserving pre-trained features can improve the calibration of fine-tuned language models.
arXiv Detail & Related papers (2023-05-30T17:35:31Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Meta-Learning Fast Weight Language Models [105.66999854213724]
We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently.
FWLs can be applied at training time so the model learns to make good use of gradient updates.
arXiv Detail & Related papers (2022-12-05T18:37:09Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.