Measuring and Narrowing the Compositionality Gap in Language Models
- URL: http://arxiv.org/abs/2210.03350v3
- Date: Tue, 17 Oct 2023 18:57:17 GMT
- Title: Measuring and Narrowing the Compositionality Gap in Language Models
- Authors: Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike
Lewis
- Abstract summary: We measure how often models can correctly answer all sub-problems but not generate the overall solution.
We present a new method, self-ask, that further improves on chain of thought.
- Score: 116.5228850227024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the ability of language models to perform compositional
reasoning tasks where the overall solution depends on correctly composing the
answers to sub-problems. We measure how often models can correctly answer all
sub-problems but not generate the overall solution, a ratio we call the
compositionality gap. We evaluate this ratio by asking multi-hop questions with
answers that require composing multiple facts unlikely to have been observed
together during pretraining. In the GPT-3 family of models, as model size
increases we show that the single-hop question answering performance improves
faster than the multi-hop performance does, therefore the compositionality gap
does not decrease. This surprising result suggests that while more powerful
models memorize and recall more factual knowledge, they show no corresponding
improvement in their ability to perform this kind of compositional reasoning.
We then demonstrate how elicitive prompting (such as chain of thought)
narrows the compositionality gap by reasoning explicitly. We present a new
method, self-ask, that further improves on chain of thought. In our method, the
model explicitly asks itself (and answers) follow-up questions before answering
the initial question. We finally show that self-ask's structured prompting lets
us easily plug in a search engine to answer the follow-up questions, which
additionally improves accuracy.
Related papers
- Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.
By analyzing patterns in mind change, we examine the correctness of the model's reasoning.
Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z) - Question Decomposition Improves the Faithfulness of Model-Generated
Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior.
One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question.
This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case.
Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z) - RECKONING: Reasoning through Dynamic Knowledge Encoding [51.076603338764706]
We show that language models can answer questions by reasoning over knowledge provided as part of the context.
In these situations, the model fails to distinguish the knowledge that is necessary to answer the question.
We propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters.
arXiv Detail & Related papers (2023-05-10T17:54:51Z) - Understanding and Improving Zero-shot Multi-hop Reasoning in Generative
Question Answering [85.79940770146557]
We decompose multi-hop questions into multiple corresponding single-hop questions.
We find marked inconsistency in QA models' answers on these pairs of ostensibly identical question chains.
When trained only on single-hop questions, models generalize poorly to multi-hop questions.
arXiv Detail & Related papers (2022-10-09T11:48:07Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Robustifying Multi-hop QA through Pseudo-Evidentiality Training [28.584236042324896]
We study the bias problem of multi-hop question answering models, of answering correctly without correct reasoning.
We propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences.
arXiv Detail & Related papers (2021-07-07T14:15:14Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.