Getting MoRE out of Mixture of Language Model Reasoning Experts
- URL: http://arxiv.org/abs/2305.14628v2
- Date: Fri, 20 Oct 2023 05:16:29 GMT
- Title: Getting MoRE out of Mixture of Language Model Reasoning Experts
- Authors: Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan
Boyd-Graber
- Abstract summary: We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models.
We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning.
Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output.
- Score: 71.61176122960464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent large language models (LLMs) improve on various question
answering (QA) datasets, it remains difficult for a single model to generalize
across question types that require distinct reasoning abilities. We provide
empirical evidence that state-of-the-art LLMs suffer from poor generalizability
on reasoning types beyond those seen in the prompt. To remedy this, we propose
a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse
specialized language models. We specialize the backbone language model with
prompts optimized for different reasoning categories, including factual,
multihop, mathematical, and commonsense reasoning. Our key insight is to
leverage agreement among the specialized experts to select the best answer for
each question, or to abstain from answering. This gives MoRE higher accuracy
than any single specialized model on a collection of 12 QA datasets from four
reasoning types. Beyond generalizability, the interpretable design of MoRE
improves selective question answering results compared to baselines without
incorporating inter-expert agreement. This framework is also more interpretable
and useful to human consumers of QA outputs. Our human study confirms that
presenting expert predictions and the answer selection process helps annotators
more accurately calibrate when to trust the system's output. We release all
code and data to facilitate future work.
Related papers
- STOC-TOT: Stochastic Tree-of-Thought with Constrained Decoding for Complex Reasoning in Multi-Hop Question Answering [8.525847131940031]
Multi-hop question answering (MHQA) requires a model to retrieve and integrate information from multiple passages to answer a complex question.
Recent systems leverage the power of large language models and integrate evidence retrieval with reasoning prompts.
We propose STOC-TOT, a tree-of-thought reasoning prompting method with constrained decoding for MHQA.
arXiv Detail & Related papers (2024-07-04T07:17:53Z) - Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning)
The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs.
It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z) - A Study on Large Language Models' Limitations in Multiple-Choice
Question Answering [0.0]
We analyze 26 small open-source models and find that 65% of the models do not understand the task.
Only 4 models properly select an answer from the given choices, and only 5 of these models are choice order independent.
arXiv Detail & Related papers (2024-01-15T20:42:16Z) - ExpertQA: Expert-Curated Questions and Attributed Answers [51.68314045809179]
We conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality.
We collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions.
The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
arXiv Detail & Related papers (2023-09-14T16:54:34Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Mixture of Experts for Biomedical Question Answering [34.92691831878302]
We propose a Mixture-of-Expert (MoE) based question answering method called MoEBQA.
MoEBQA decouples the computation for different types of questions by sparse routing.
We evaluate MoEBQA on three Biomedical Question Answering (BQA) datasets constructed based on real examinations.
arXiv Detail & Related papers (2022-04-15T14:11:40Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.