Related papers: Explanation-based Finetuning Makes Models More Robust to Spurious Cues

Explanation-based Finetuning Makes Models More Robust to Spurious Cues

URL: http://arxiv.org/abs/2305.04990v3
Date: Tue, 6 Jun 2023 15:31:33 GMT
Title: Explanation-based Finetuning Makes Models More Robust to Spurious Cues
Authors: Josh Magnus Ludan, Yixuan Meng, Tai Nguyen, Saurabh Shah, Qing Lyu, Marianna Apidianaki, Chris Callison-Burch
Abstract summary: Large Language Models (LLMs) are so powerful that they sometimes learn correlations between labels and features that are irrelevant to the task. We propose explanation-based finetuning as a general approach to mitigate LLMs' reliance on spurious correlations. We finetune the model to additionally generate a free-text explanation supporting its answer.
Score: 21.327036110196637
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) are so powerful that they sometimes learn correlations between labels and features that are irrelevant to the task, leading to poor generalization on out-of-distribution data. We propose explanation-based finetuning as a general approach to mitigate LLMs' reliance on spurious correlations. Unlike standard finetuning where the model only predicts the answer given the input, we finetune the model to additionally generate a free-text explanation supporting its answer. To evaluate our method, we finetune the model on artificially constructed training sets containing different types of spurious cues, and test it on a test set without these cues. Compared to standard finetuning, our method makes GPT-3 (davinci) remarkably more robust against spurious cues in terms of accuracy drop across four classification tasks: ComVE (+1.2), CREAK (+9.1), e-SNLI (+15.4), and SBIC (+6.5). The efficacy generalizes across multiple model families and scales, with greater gains for larger models. Finally, our method also works well with explanations generated by the model, implying its applicability to more datasets without human-written explanations.

Related papers

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment [80.04449725137177]
Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback. Our study reveals a striking, safety-specific phenomenon associated with DPO alignment. Using solely self-generated responses for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models.
arXiv Detail & Related papers (2025-04-03T00:36:40Z)
Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z)
Take It Easy: Label-Adaptive Self-Rationalization for Fact Verification and Explanation Generation [15.94564349084642]
Self-rationalization method is typically used in natural language inference tasks. We fine-tune a model to learn veracity prediction with annotated labels. We generate synthetic explanations from three large language models.
arXiv Detail & Related papers (2024-10-05T02:19:49Z)
Model Editing with Canonical Examples [75.33218320106585]
We introduce model editing with canonical examples. A canonical example is a simple instance of good behavior, e.g., The capital of Mauritius is Port Louis. We propose sense finetuning, which selects and finetunes a few sense vectors for each canonical example.
arXiv Detail & Related papers (2024-02-09T03:08:12Z)
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces. We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z)
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks [24.069447197357164]
Next-word prediction is often not an efficient formulation for many NLP tasks. We propose text alignment as an efficient unified model for a wide range of crucial tasks. Our model delivers on par or even superior performance with much smaller model sizes.
arXiv Detail & Related papers (2023-07-06T02:28:31Z)
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z)
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing [38.770055054268965]
Recent work has shown considerable improvements on many NLP tasks from model scaling. Fine-tuning generally has flat or negative scaling curves on out-of-distribution compositional generalization. In-context learning has positive scaling curves, but is generally outperformed by much smaller fine-tuned models.
arXiv Detail & Related papers (2022-05-24T17:57:39Z)
Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models. Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers. We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z)
Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills [32.55545292360155]
We propose to leverage semi-structured tables, and automatically generate at scale question-paragraph pairs. We add a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills. We show that our model, PReasM, substantially outperforms T5, a popular pre-trained encoder-decoder model.
arXiv Detail & Related papers (2021-07-15T11:37:14Z)
Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension [27.538957000237176]
Humans create questions adversarially, such that the model fails to answer them correctly. We collect 36,000 samples with progressively stronger models in the annotation loop. We find that training on adversarially collected samples leads to strong generalisation to non-adversarially collected datasets. We find that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.
arXiv Detail & Related papers (2020-02-02T00:22:55Z)
AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering. The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch. The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level. The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.