GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
- URL: http://arxiv.org/abs/2309.16583v6
- Date: Tue, 2 Apr 2024 01:51:38 GMT
- Title: GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
- Authors: Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang,
- Abstract summary: GPT-Fathom is an open-source and reproducible evaluation suite for large language models (LLMs) built on top of OpenAI Evals.
We evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all aligned under settings.
- Score: 29.778018058541676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.
Related papers
- LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators [22.567933207841968]
Large Language Models (LLMs) and AI assistants are experiencing exponential growth in usage among both expert and amateur users.
In this work, we focus on evaluating the reliability of current LLMs as science communicators.
We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts.
arXiv Detail & Related papers (2024-09-21T06:48:32Z) - See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop.
Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances.
We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z) - CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences [2.3749120526936465]
We propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences.
CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs.
In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-14T01:51:35Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Fine-Tuning and Prompt Engineering for Large Language Models-based Code Review Automation [4.941630596191807]
Fine-tuning and prompting are common approaches to leveraging Large Language Models (LLMs) for code review automation.
We use model fine-tuning and inference techniques (i.e., zero-shot learning, few-shot learning and persona) on LLMs-based code review automation.
Our results show that GPT-3.5 with zero-shot learning achieves 73.17% -74.23% higher EM than the Guo et al.'s approach.
arXiv Detail & Related papers (2024-02-01T03:10:26Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - A Survey of GPT-3 Family Large Language Models Including ChatGPT and
GPT-4 [4.206175795966694]
Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation.
We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs)
arXiv Detail & Related papers (2023-10-04T16:37:05Z) - Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance [11.595274304409937]
Large language models (LLMs) have revolutionized zero-shot task performance.
Current methods using trigger phrases such as "Let's think step by step" remain limited.
This study introduces PRomPTed, an approach that optimize the zero-shot prompts for individual task instances.
arXiv Detail & Related papers (2023-10-03T14:51:34Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.