DREAM: Uncovering Mental Models behind Language Models
- URL: http://arxiv.org/abs/2112.08656v1
- Date: Thu, 16 Dec 2021 06:22:47 GMT
- Title: DREAM: Uncovering Mental Models behind Language Models
- Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark
- Abstract summary: DREAM is a model that takes a situational question as input to produce a mental model elaborating the situation.
It inherits its social commonsense through distant supervision from existing NLP resources.
Mental models generated by DREAM can be used as additional context for situational QA tasks.
- Score: 15.71233907204059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To what extent do language models (LMs) build "mental models" of a scene when
answering situated questions (e.g., questions about a specific ethical
dilemma)? While cognitive science has shown that mental models play a
fundamental role in human problem-solving, it is unclear whether the high
question-answering performance of existing LMs is backed by similar model
building - and if not, whether that can explain their well-known catastrophic
failures. We observed that Macaw, an existing T5-based LM, when probed provides
somewhat useful but inadequate mental models for situational questions
(estimated accuracy=43%, usefulness=21%, consistency=42%). We propose DREAM, a
model that takes a situational question as input to produce a mental model
elaborating the situation, without any additional task specific training data
for mental models. It inherits its social commonsense through distant
supervision from existing NLP resources. Our analysis shows that DREAM can
produce significantly better mental models (estimated accuracy=67%,
usefulness=37%, consistency=71%) compared to Macaw. Finally, mental models
generated by DREAM can be used as additional context for situational QA tasks.
This additional context improves the answer accuracy of a Macaw zero-shot model
by between +1% and +4% (absolute) on three different datasets.
Related papers
- Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.
We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)
We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.
Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z) - Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.
SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.
We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - What's "up" with vision-language models? Investigating their struggle
with spatial reasoning [76.2406963762722]
Three new corpora quantify model comprehension of basic spatial relations.
We evaluate 18 vision-language (VL) models, finding that all perform poorly.
We conclude by studying causes of this surprising behavior.
arXiv Detail & Related papers (2023-10-30T17:50:15Z) - Do Large Language Models have Shared Weaknesses in Medical Question Answering? [1.25828876338076]
Large language models (LLMs) have made rapid improvement on medical benchmarks, but their unreliability remains a persistent challenge for safe real-world uses.
We benchmark a range of top LLMs and identify consistent patterns across models.
We found evidence of similarities between models in which questions they answer correctly, as well as similarities with human test takers.
arXiv Detail & Related papers (2023-10-11T06:26:19Z) - Probing the Multi-turn Planning Capabilities of LLMs via 20 Question
Games [14.063311955315077]
Large language models (LLMs) are effective at answering questions that are clearly asked.
When faced with ambiguous queries they can act unpredictably and produce incorrect outputs.
This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
arXiv Detail & Related papers (2023-10-02T16:55:37Z) - Negated Complementary Commonsense using Large Language Models [3.42658286826597]
This work focuses on finding answers to negated complementary questions in commonsense scenarios.
We propose a model-agnostic methodology to improve the performance in negated complementary scenarios.
arXiv Detail & Related papers (2023-07-13T15:03:48Z) - Enhancing Self-Consistency and Performance of Pre-Trained Language
Models through Natural Language Inference [72.61732440246954]
Large pre-trained language models often lack logical consistency across test inputs.
We propose a framework, ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models.
We show that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models.
arXiv Detail & Related papers (2022-11-21T21:58:30Z) - "John is 50 years old, can his son be 65?" Evaluating NLP Models'
Understanding of Feasibility [19.47954905054217]
This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible.
We show that even state-of-the-art models such as GPT-3 struggle to answer the feasibility questions correctly.
arXiv Detail & Related papers (2022-10-14T02:46:06Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - What's the best place for an AI conference, Vancouver or ______: Why
completing comparative questions is difficult [22.04829832439774]
We study the ability of neural LMs to ask (not answer) reasonable questions.
We show that accuracy in this fill-in-the-blank task is well-correlated with human judgements of whether a question is reasonable.
arXiv Detail & Related papers (2021-04-05T14:56:09Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.