Chain-of-Thought Hub: A Continuous Effort to Measure Large Language
Models' Reasoning Performance
- URL: http://arxiv.org/abs/2305.17306v1
- Date: Fri, 26 May 2023 23:46:42 GMT
- Title: Chain-of-Thought Hub: A Continuous Effort to Measure Large Language
Models' Reasoning Performance
- Authors: Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng and Tushar Khot
- Abstract summary: Chain-of-Thought Hub is an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
- Score: 35.38549845444575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are continuously being developed, their
evaluation becomes increasingly important yet challenging. This work proposes
Chain-of-Thought Hub, an open-source evaluation suite on the multi-step
reasoning capabilities of large language models. We are interested in this
setting for two reasons: (1) from the behavior of GPT and PaLM model family, we
observe that complex reasoning is likely to be a key differentiator between
weaker and stronger LLMs; (2) we envisage large language models to become the
next-generation computational platform and foster an ecosystem of LLM-based new
applications, this naturally requires the foundation models to perform complex
tasks that often involve the composition of linguistic and logical operations.
Our approach is to compile a suite of challenging reasoning benchmarks to track
the progress of LLMs. Our current results show that: (1) model scale clearly
correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and
PaLM-2 are the only two models that are comparable with GPT-4, while
open-sourced models still lag behind; (3) LLaMA-65B performs closely to
code-davinci-002, indicating that with successful further development such as
reinforcement learning from human feedback (RLHF), it has great potential to be
close to GPT-3.5-Turbo. Our results also suggest that for the open-source
efforts to catch up, the community may focus more on building better base
models and exploring RLHF.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - CogBench: a large language model walks into a psychology lab [12.981407327149679]
This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments.
We apply CogBench to 35 large language models (LLMs) and analyze this data using statistical multilevel modeling techniques.
We find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior.
arXiv Detail & Related papers (2024-02-28T10:43:54Z) - Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - GLoRE: Evaluating Logical Reasoning of Large Language Models [29.914546407784552]
We introduce GLoRE, a benchmark comprised of 12 datasets that span three different types of tasks.
ChatGPT and GPT-4 show a strong capability of logical reasoning, with GPT-4 surpassing ChatGPT by a large margin.
We propose a self-consistency probing method to enhance the accuracy of ChatGPT and a fine-tuned method to boost the performance of an open LLM.
arXiv Detail & Related papers (2023-10-13T13:52:15Z) - Large Language Models Are Also Good Prototypical Commonsense Reasoners [11.108562540123387]
Traditional fine-tuning approaches can be resource-intensive and potentially compromise a model's generalization capacity.
We draw inspiration from the outputs of large models for tailored tasks and semi-automatically developed a set of novel prompts.
With better designed prompts we can achieve the new state-of-art(SOTA) on the ProtoQA leaderboard.
arXiv Detail & Related papers (2023-09-22T20:07:24Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.