Meta-tuning Language Models to Answer Prompts Better
- URL: http://arxiv.org/abs/2104.04670v1
- Date: Sat, 10 Apr 2021 02:57:22 GMT
- Title: Meta-tuning Language Models to Answer Prompts Better
- Authors: Ruiqi Zhong, Kristy Lee, Zheng Zhang, Dan Klein
- Abstract summary: Large pretrained language models like GPT-3 have acquired a surprising ability to perform zero-shot classification (ZSC)
We propose meta-tuning, which trains the model to specialize in answering prompts but still generalize to unseen tasks.
After meta-tuning, our model outperforms a same-sized QA model for most labels on unseen tasks.
- Score: 35.71265221884353
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large pretrained language models like GPT-3 have acquired a surprising
ability to perform zero-shot classification (ZSC). For example, to classify
review sentiments, we can "prompt" the language model with the review and the
question "Is the review positive?" as the context, and ask it to predict
whether the next word is "Yes" or "No". However, these models are not
specialized for answering these prompts. To address this weakness, we propose
meta-tuning, which trains the model to specialize in answering prompts but
still generalize to unseen tasks. To create the training data, we aggregated 43
existing datasets, annotated 441 label descriptions in total, and unified them
into the above question answering (QA) format. After meta-tuning, our model
outperforms a same-sized QA model for most labels on unseen tasks, and we
forecast that the performance would improve for even larger models. Therefore,
measuring ZSC performance on non-specialized language models might
underestimate their true capability, and community-wide efforts on aggregating
datasets and unifying their formats can help build models that understand
prompts better.
Related papers
- Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial.
In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations.
We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z) - The Art of Saying No: Contextual Noncompliance in Language Models [123.383993700586]
We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests.
Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests.
To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts.
arXiv Detail & Related papers (2024-07-02T07:12:51Z) - Prompting-based Synthetic Data Generation for Few-Shot Question Answering [23.97949073816028]
We show that using large language models can improve Question Answering performance on various datasets in the few-shot setting.
We suggest that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme.
arXiv Detail & Related papers (2024-05-15T13:36:43Z) - Sign of the Times: Evaluating the use of Large Language Models for Idiomaticity Detection [2.2724928083094196]
This work looks at the performance of a range of LLMs on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE.
We find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales.
arXiv Detail & Related papers (2024-05-15T11:55:14Z) - Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering [26.34649731975005]
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA)
While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics unreliable for accurately quantifying model performance.
We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness) and 2) whether they produce a response based on the provided knowledge (faithfulness)
arXiv Detail & Related papers (2023-07-31T17:41:00Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z) - Generative Language Models for Paragraph-Level Question Generation [79.31199020420827]
Powerful generative models have led to recent progress in question generation (QG)
It is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches.
We introduce QG-Bench, a benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting.
arXiv Detail & Related papers (2022-10-08T10:24:39Z) - NLX-GPT: A Model for Natural Language Explanations in Vision and
Vision-Language Tasks [18.13793282306575]
Natural language explanation (NLE) models aim at explaining the decision-making process of a black box system.
We introduce NLX-GPT, a general, compact and faithful language model that can simultaneously predict an answer and explain it.
We then address the problem of evaluating the explanations which can be in many times generic, data-biased and can come in several forms.
arXiv Detail & Related papers (2022-03-09T22:57:15Z) - Turning Tables: Generating Examples from Semi-structured Tables for
Endowing Language Models with Reasoning Skills [32.55545292360155]
We propose to leverage semi-structured tables, and automatically generate at scale question-paragraph pairs.
We add a pre-training step over this synthetic data, which includes examples that require 16 different reasoning skills.
We show that our model, PReasM, substantially outperforms T5, a popular pre-trained encoder-decoder model.
arXiv Detail & Related papers (2021-07-15T11:37:14Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.