Optimizing Language Model's Reasoning Abilities with Weak Supervision
- URL: http://arxiv.org/abs/2405.04086v1
- Date: Tue, 7 May 2024 07:39:15 GMT
- Title: Optimizing Language Model's Reasoning Abilities with Weak Supervision
- Authors: Yongqi Tong, Sizhe Wang, Dawei Li, Yifan Wang, Simeng Han, Zi Lin, Chengsong Huang, Jiaxin Huang, Jingbo Shang,
- Abstract summary: We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
- Score: 48.60598455782159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.
Related papers
- SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks.
We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM.
We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z) - Learning to Reduce: Optimal Representations of Structured Data in
Prompting Large Language Models [42.16047343029512]
Large Language Models (LLMs) have been widely used as general-purpose AI agents.
We propose a framework, Learning to Reduce, that fine-tunes a language model to generate a reduced version of an input context.
We show that our model achieves comparable accuracies in selecting the relevant evidence from an input context.
arXiv Detail & Related papers (2024-02-22T00:41:23Z) - Enhancing Textbook Question Answering Task with Large Language Models
and Retrieval Augmented Generation [3.948068081583197]
This paper proposes a methodology that handle the out-of-domain scenario in Textbook question answering (TQA)
Through supervised fine-tuning of the LLM model Llama-2 and the incorporation of RAG, our architecture outperforms the baseline, achieving a 4.12% accuracy improvement on validation set and 9.84% on test set for non-diagram multiple-choice questions.
arXiv Detail & Related papers (2024-02-05T11:58:56Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Teaching Smaller Language Models To Generalise To Unseen Compositional
Questions [6.9076450524134145]
We propose a combination of multitask pretraining on up to 93 tasks designed to instill diverse reasoning abilities.
We show that performance can be significantly improved by adding retrieval-augmented training datasets.
arXiv Detail & Related papers (2023-08-02T05:00:12Z) - Enhancing In-Context Learning with Answer Feedback for Multi-Span
Question Answering [9.158919909909146]
In this paper, we propose a novel way of employing labeled data such as it informs LLM of some undesired output.
Experiments on three multi-span question answering datasets and a keyphrase extraction dataset show that our new prompting strategy consistently improves LLM's in-context learning performance.
arXiv Detail & Related papers (2023-06-07T15:20:24Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.