Large Language Models Can Self-Improve
- URL: http://arxiv.org/abs/2210.11610v1
- Date: Thu, 20 Oct 2022 21:53:54 GMT
- Title: Large Language Models Can Self-Improve
- Authors: Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang,
Hongkun Yu, Jiawei Han
- Abstract summary: We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions.
We show that our approach achieves state-of-the-art-level performance, without any ground truth label.
- Score: 34.78624270280148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have achieved excellent performances in various
tasks. However, fine-tuning an LLM requires extensive supervision. Human, on
the other hand, may improve their reasoning abilities by self-thinking without
external inputs. In this work, we demonstrate that an LLM is also capable of
self-improving with only unlabeled datasets. We use a pre-trained LLM to
generate "high-confidence" rationale-augmented answers for unlabeled questions
using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM
using those self-generated solutions as target outputs. We show that our
approach improves the general reasoning ability of a 540B-parameter LLM
(74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and
63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance,
without any ground truth label. We conduct ablation studies and show that
fine-tuning on reasoning is critical for self-improvement.
Related papers
- Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization [12.885866125783618]
Large Language Models (LLMs) tend to produce inaccurate responses to specific queries.
We construct an adversarial dataset, named as $textbfADT (Adrial dataset for Tokenizer)$ to challenge LLMs' tokenization.
Our empirical results reveal that our ADT is highly effective on challenging the tokenization of leading LLMs, including GPT-4o, Llama-3, Qwen2.5-max and so on.
arXiv Detail & Related papers (2024-05-27T11:39:59Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - Self-Explore to Avoid the Pit: Improving the Reasoning Capabilities of Language Models with Fine-grained Rewards [42.065997425172974]
Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of large language models (LLMs)
We propose Self-Explore, where the LLM is tasked to explore the first wrong step within the rationale and use such signals as fine-grained rewards for further improvement.
On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT)
arXiv Detail & Related papers (2024-04-16T07:30:11Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - GRATH: Gradual Self-Truthifying for Large Language Models [63.502835648056305]
GRAdual self-truTHifying (GRATH) is a novel post-processing method to enhance truthfulness of large language models (LLMs)
GRATH iteratively refines truthfulness data and updates the model, leading to a gradual improvement in model truthfulness in a self-supervised manner.
GRATH achieves state-of-the-art performance on TruthfulQA, with MC1 accuracy of 54.71% and MC2 accuracy of 69.10%, which even surpass those on 70B-LLMs.
arXiv Detail & Related papers (2024-01-22T19:00:08Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - A & B == B & A: Triggering Logical Reasoning Failures in Large Language
Models [65.86149763739141]
We introduce LogicAsker, an automatic approach that comprehensively evaluates and improves the logical reasoning abilities of LLMs.
We evaluate LogicAsker on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco.
The results show that test cases from LogicAsker can find logical reasoning failures in different LLMs with a rate of 25% - 94%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.