Related papers: Measuring and Narrowing the Compositionality Gap in Language Models

Measuring and Narrowing the Compositionality Gap in Language Models

URL: http://arxiv.org/abs/2210.03350v3
Date: Tue, 17 Oct 2023 18:57:17 GMT
Title: Measuring and Narrowing the Compositionality Gap in Language Models
Authors: Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike Lewis
Abstract summary: We measure how often models can correctly answer all sub-problems but not generate the overall solution. We present a new method, self-ask, that further improves on chain of thought.
Score: 116.5228850227024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

Related papers

Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning. By analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z)
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior. One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question. This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z)
RECKONING: Reasoning through Dynamic Knowledge Encoding [51.076603338764706]
We show that language models can answer questions by reasoning over knowledge provided as part of the context. In these situations, the model fails to distinguish the knowledge that is necessary to answer the question. We propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters.
arXiv Detail & Related papers (2023-05-10T17:54:51Z)
Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering [85.79940770146557]
We decompose multi-hop questions into multiple corresponding single-hop questions. We find marked inconsistency in QA models' answers on these pairs of ostensibly identical question chains. When trained only on single-hop questions, models generalize poorly to multi-hop questions.
arXiv Detail & Related papers (2022-10-09T11:48:07Z)
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
Robustifying Multi-hop QA through Pseudo-Evidentiality Training [28.584236042324896]
We study the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. We propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences.
arXiv Detail & Related papers (2021-07-07T14:15:14Z)
Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering. Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.