Structured, flexible, and robust: benchmarking and improving large
language models towards more human-like behavior in out-of-distribution
reasoning tasks
- URL: http://arxiv.org/abs/2205.05718v1
- Date: Wed, 11 May 2022 18:14:33 GMT
- Title: Structured, flexible, and robust: benchmarking and improving large
language models towards more human-like behavior in out-of-distribution
reasoning tasks
- Authors: Katherine M. Collins, Catherine Wong, Jiahai Feng, Megan Wei, and
Joshua B. Tenenbaum
- Abstract summary: We ask how much of human-like thinking can be captured by learning statistical patterns in language alone.
Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization.
We find that humans are far more robust than LLMs on this benchmark.
- Score: 39.39138995087475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human language offers a powerful window into our thoughts -- we tell stories,
give explanations, and express our beliefs and goals through words. Abundant
evidence also suggests that language plays a developmental role in structuring
our learning. Here, we ask: how much of human-like thinking can be captured by
learning statistical patterns in language alone? We first contribute a new
challenge benchmark for comparing humans and distributional large language
models (LLMs). Our benchmark contains two problem-solving domains (planning and
explanation generation) and is designed to require generalization to new,
out-of-distribution problems expressed in language. We find that humans are far
more robust than LLMs on this benchmark. Next, we propose a hybrid
Parse-and-Solve model, which augments distributional LLMs with a structured
symbolic reasoning module. We find that this model shows more robust adaptation
to out-of-distribution planning problems, demonstrating the promise of hybrid
AI models for more human-like reasoning.
Related papers
- From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition [6.617999710257379]
We propose a three-stage framework to assess the abilities of LMs.
We evaluate the generative capacities of LMs using methods from linguistic research.
arXiv Detail & Related papers (2024-10-17T06:31:49Z) - A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? [2.7342737448775534]
Large Language Models (LLMs) have been linked to claims about human-like linguistic performance.
We analyze the contribution of LLMs as theoretically informative representations of a target cognitive system.
We evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing.
arXiv Detail & Related papers (2023-07-26T18:58:53Z) - From Word Models to World Models: Translating from Natural Language to
the Probabilistic Language of Thought [124.40905824051079]
We propose rational meaning construction, a computational framework for language-informed thinking.
We frame linguistic meaning as a context-sensitive mapping from natural language into a probabilistic language of thought.
We show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings.
We extend our framework to integrate cognitively-motivated symbolic modules.
arXiv Detail & Related papers (2023-06-22T05:14:00Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Psychologically-informed chain-of-thought prompts for metaphor
understanding in large language models [29.993190226231793]
We use chain-of-thought prompts to introduce structures from probabilistic models into large language models.
Our prompts lead language models to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors.
arXiv Detail & Related papers (2022-09-16T19:23:13Z) - Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought.
Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z) - Few-Shot Self-Rationalization with Natural Language Prompts [29.23404535276466]
Self-rationalization models that predict task labels generate free-text elaborations for their predictions.
These models are, however, currently trained with a large amount of human-written free-text explanations for each task.
We propose to study a more realistic setting of self-rationalization using few training examples.
arXiv Detail & Related papers (2021-11-16T08:21:40Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.