FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
- URL: http://arxiv.org/abs/2402.18667v1
- Date: Wed, 28 Feb 2024 19:23:27 GMT
- Title: FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
- Authors: Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu,
Wenpeng Yin, Caiming Xiong
- Abstract summary: FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
- Score: 70.84333325049123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents FoFo, a pioneering benchmark for evaluating large
language models' (LLMs) ability to follow complex, domain-specific formats, a
crucial yet underexamined capability for their application as AI agents.
Despite LLMs' advancements, existing benchmarks fail to assess their
format-following proficiency adequately. FoFo fills this gap with a diverse
range of real-world formats and instructions, developed through an AI-Human
collaborative method. Our evaluation across both open-source (e.g., Llama 2,
WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three
key findings: open-source models significantly lag behind closed-source ones in
format adherence; LLMs' format-following performance is independent of their
content generation quality; and LLMs' format proficiency varies across
different domains. These insights suggest the need for specialized tuning for
format-following skills and highlight FoFo's role in guiding the selection of
domain-specific AI agents. FoFo is released here at
https://github.com/SalesforceAIResearch/FoFo.
Related papers
- Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations [2.699900017799093]
We focus on fine tuning LLaMA, an open source LLM using proprietary documents and code from an enterprise repository.
As part of this work, we aim to guide beginners on how to start with fine tuning an LLM for documentation and code.
We also propose pre processing recipes for both documentation and code to prepare dataset in different formats.
arXiv Detail & Related papers (2024-03-23T13:25:01Z) - Harnessing Multi-Role Capabilities of Large Language Models for
Open-Domain Question Answering [40.2758450304531]
Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems.
We propose a framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation.
We introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers.
arXiv Detail & Related papers (2024-03-08T11:09:13Z) - PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs [49.32067576992511]
Large language models often fall short of the performance achieved by domain-specific state-of-the-art models.
One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets.
We propose Preference Adaptation for Enhancing Domain-specific Abilities of LLMs (PANDA)
Our experimental results reveal that PANDA significantly enhances the domain-specific ability of LLMs on text classification and interactive decision tasks.
arXiv Detail & Related papers (2024-02-20T09:02:55Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.