Related papers: Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks

Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks

URL: http://arxiv.org/abs/2205.05718v1
Date: Wed, 11 May 2022 18:14:33 GMT
Title: Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks
Authors: Katherine M. Collins, Catherine Wong, Jiahai Feng, Megan Wei, and Joshua B. Tenenbaum
Abstract summary: We ask how much of human-like thinking can be captured by learning statistical patterns in language alone. Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization. We find that humans are far more robust than LLMs on this benchmark.
Score: 39.39138995087475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human language offers a powerful window into our thoughts -- we tell stories, give explanations, and express our beliefs and goals through words. Abundant evidence also suggests that language plays a developmental role in structuring our learning. Here, we ask: how much of human-like thinking can be captured by learning statistical patterns in language alone? We first contribute a new challenge benchmark for comparing humans and distributional large language models (LLMs). Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization to new, out-of-distribution problems expressed in language. We find that humans are far more robust than LLMs on this benchmark. Next, we propose a hybrid Parse-and-Solve model, which augments distributional LLMs with a structured symbolic reasoning module. We find that this model shows more robust adaptation to out-of-distribution planning problems, demonstrating the promise of hybrid AI models for more human-like reasoning.

Related papers

Bayesian Social Deduction with Graph-Informed Language Models [3.7540464038118633]
Social reasoning remains a challenging task for large language models.<n>We introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model.<n>Our approach achieves competitive performance with much larger models in Agent-Agent play.
arXiv Detail & Related papers (2025-06-21T18:45:28Z)
The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs [54.59207567677249]
Large language models (LLMs) still struggle across tasks outside of high-resource languages.<n>In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce.
arXiv Detail & Related papers (2025-05-23T20:28:31Z)
On the Thinking-Language Modeling Gap in Large Language Models [68.83670974539108]
We show that there is a significant gap between the modeling of languages and thoughts.<n>We propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap.
arXiv Detail & Related papers (2025-05-19T09:31:52Z)
Negation: A Pink Elephant in the Large Language Models' Room? [2.8078480738404]
Negations are key to determining sentence meaning, making them essential for logical reasoning. We investigate how model size and language impact its ability to handle negation correctly by evaluating popular language models. Our datasets can facilitate further research and improvements of language model reasoning in multilingual settings.
arXiv Detail & Related papers (2025-03-28T13:04:41Z)
Non-literal Understanding of Number Words by Language Models [33.24263583093367]
Humans naturally interpret numbers non-literally, combining context, world knowledge, and speaker intent. We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects.
arXiv Detail & Related papers (2025-02-10T07:03:00Z)
From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition [6.617999710257379]
We propose a three-stage framework to assess the abilities of LMs. We evaluate the generative capacities of LMs using methods from linguistic research.
arXiv Detail & Related papers (2024-10-17T06:31:49Z)
A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? [2.7342737448775534]
Large Language Models (LLMs) have been linked to claims about human-like linguistic performance. We analyze the contribution of LLMs as theoretically informative representations of a target cognitive system. We evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing.
arXiv Detail & Related papers (2023-07-26T18:58:53Z)
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought [124.40905824051079]
We propose rational meaning construction, a computational framework for language-informed thinking. We frame linguistic meaning as a context-sensitive mapping from natural language into a probabilistic language of thought. We show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings. We extend our framework to integrate cognitively-motivated symbolic modules.
arXiv Detail & Related papers (2023-06-22T05:14:00Z)
A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models [29.993190226231793]
We use chain-of-thought prompts to introduce structures from probabilistic models into large language models. Our prompts lead language models to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors.
arXiv Detail & Related papers (2022-09-16T19:23:13Z)
Chain of Thought Prompting Elicits Reasoning in Large Language Models [56.811278668446825]
This paper explores the ability of language models to generate a coherent chain of thought. Experiments show that inducing a chain of thought via prompting can enable sufficiently large language models to better perform reasoning tasks.
arXiv Detail & Related papers (2022-01-28T02:33:07Z)
Few-Shot Self-Rationalization with Natural Language Prompts [29.23404535276466]
Self-rationalization models that predict task labels generate free-text elaborations for their predictions. These models are, however, currently trained with a large amount of human-written free-text explanations for each task. We propose to study a more realistic setting of self-rationalization using few training examples.
arXiv Detail & Related papers (2021-11-16T08:21:40Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.