TuringAdvice: A Generative and Dynamic Evaluation of Language Use
- URL: http://arxiv.org/abs/2004.03607v2
- Date: Tue, 13 Apr 2021 01:05:17 GMT
- Title: TuringAdvice: A Generative and Dynamic Evaluation of Language Use
- Authors: Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali
Farhadi, Yejin Choi
- Abstract summary: We propose TuringAdvice, a new challenge task and dataset for language understanding models.
Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language.
Empirical results show that today's models struggle at TuringAdvice.
- Score: 90.3029315711237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose TuringAdvice, a new challenge task and dataset for language
understanding models. Given a written situation that a real person is currently
facing, a model must generate helpful advice in natural language. Our
evaluation framework tests a fundamental aspect of human language
understanding: our ability to use language to resolve open-ended situations by
communicating with each other.
Empirical results show that today's models struggle at TuringAdvice, even
multibillion parameter models finetuned on 600k in-domain training examples.
The best model, a finetuned T5, writes advice that is at least as helpful as
human-written advice in only 14% of cases; a much larger non-finetunable GPT3
model does even worse at 4%. This low performance reveals language
understanding errors that are hard to spot outside of a generative setting,
showing much room for progress.
Related papers
- What Makes Language Models Good-enough? [11.763229353978321]
Psycholinguistic research suggests that humans may build a representation of linguistic input that is 'good-enough' for the task at hand.
This study examines what architectural features make language models learn human-like good-enough language processing.
arXiv Detail & Related papers (2024-06-06T00:51:28Z) - Robustifying Language Models with Test-Time Adaptation [17.96043752001886]
Large-scale language models achieved state-of-the-art performance over a number of language tasks.
They fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans.
We show that we can reverse many language adversarial attacks by adapting the input sentence with predictions from masked words.
arXiv Detail & Related papers (2023-10-29T22:37:54Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - Training language models to follow instructions with human feedback [29.590666996229206]
We show an avenue for aligning language models with user intent by fine-tuning with human feedback.
InstructGPT models show improvements in truthfulness and reductions in toxic output generation.
arXiv Detail & Related papers (2022-03-04T07:04:42Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Prompt Programming for Large Language Models: Beyond the Few-Shot
Paradigm [0.0]
We discuss methods of prompt programming, emphasizing the usefulness of considering prompts through the lens of natural language.
We introduce the idea of a metaprompt that seeds the model to generate its own natural language prompts for a range of tasks.
arXiv Detail & Related papers (2021-02-15T05:27:55Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.