Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering
- URL: http://arxiv.org/abs/2209.09513v1
- Date: Tue, 20 Sep 2022 07:04:24 GMT
- Title: Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering
- Authors: Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun
Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan
- Abstract summary: We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
- Score: 124.16250115608604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When answering a question, humans utilize the information available across
different modalities to synthesize a consistent and complete chain of thought
(CoT). This process is normally a black box in the case of deep learning models
like large-scale language models. Recently, science question benchmarks have
been used to diagnose the multi-hop reasoning ability and interpretability of
an AI system. However, existing datasets fail to provide annotations for the
answers, or are restricted to the textual-only modality, small scales, and
limited domain diversity. To this end, we present Science Question Answering
(SQA), a new benchmark that consists of ~21k multimodal multiple choice
questions with a diverse set of science topics and annotations of their answers
with corresponding lectures and explanations. We further design language models
to learn to generate lectures and explanations as the chain of thought (CoT) to
mimic the multi-hop reasoning process when answering SQA questions. SQA
demonstrates the utility of CoT in language models, as CoT improves the
question answering performance by 1.20% in few-shot GPT-3 and 3.99% in
fine-tuned UnifiedQA. We also explore the upper bound for models to leverage
explanations by feeding those in the input; we observe that it improves the
few-shot performance of GPT-3 by 18.96%. Our analysis further shows that
language models, similar to humans, benefit from explanations to learn from
fewer data and achieve the same performance with just 40% of the data.
Related papers
- STOC-TOT: Stochastic Tree-of-Thought with Constrained Decoding for Complex Reasoning in Multi-Hop Question Answering [8.525847131940031]
Multi-hop question answering (MHQA) requires a model to retrieve and integrate information from multiple passages to answer a complex question.
Recent systems leverage the power of large language models and integrate evidence retrieval with reasoning prompts.
We propose STOC-TOT, a tree-of-thought reasoning prompting method with constrained decoding for MHQA.
arXiv Detail & Related papers (2024-07-04T07:17:53Z) - Getting MoRE out of Mixture of Language Model Reasoning Experts [71.61176122960464]
We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models.
We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning.
Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output.
arXiv Detail & Related papers (2023-05-24T02:00:51Z) - T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large
Language Model Signals for Science Question Answering [59.63860993280275]
Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks.
We propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals.
Our approach achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%.
arXiv Detail & Related papers (2023-05-05T11:56:30Z) - STREET: A Multi-Task Structured Reasoning and Explanation Benchmark [56.555662318619135]
We introduce a unified multi-task and multi-domain natural language reasoning and explanation benchmark.
We expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer.
arXiv Detail & Related papers (2023-02-13T22:34:02Z) - Understanding and Improving Zero-shot Multi-hop Reasoning in Generative
Question Answering [85.79940770146557]
We decompose multi-hop questions into multiple corresponding single-hop questions.
We find marked inconsistency in QA models' answers on these pairs of ostensibly identical question chains.
When trained only on single-hop questions, models generalize poorly to multi-hop questions.
arXiv Detail & Related papers (2022-10-09T11:48:07Z) - Zero-shot Commonsense Question Answering with Cloze Translation and
Consistency Optimization [20.14487209460865]
We investigate four translation methods that can translate natural questions into cloze-style sentences.
We show that our methods are complementary datasets to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2022-01-01T07:12:49Z) - Learning to Explain: Datasets and Models for Identifying Valid Reasoning
Chains in Multihop Question-Answering [28.67167530758428]
We introduce three datasets in which explanations formed from corpus facts are annotated.
eQASC contains over 98K explanation annotations for the multihop question answering dataset QASC.
eQASC-perturbed is constructed by crowd-sourcing perturbations to test consistency and generalization of explanation prediction models.
eOBQA is constructed by adding explanation annotations to the OBQA dataset to test generalization of models trained on eQASC.
arXiv Detail & Related papers (2020-10-07T08:46:02Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.