FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
- URL: http://arxiv.org/abs/2402.18667v1
- Date: Wed, 28 Feb 2024 19:23:27 GMT
- Title: FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
- Authors: Congying Xia, Chen Xing, Jiangshu Du, Xinyi Yang, Yihao Feng, Ran Xu,
Wenpeng Yin, Caiming Xiong
- Abstract summary: FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
- Score: 70.84333325049123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents FoFo, a pioneering benchmark for evaluating large
language models' (LLMs) ability to follow complex, domain-specific formats, a
crucial yet underexamined capability for their application as AI agents.
Despite LLMs' advancements, existing benchmarks fail to assess their
format-following proficiency adequately. FoFo fills this gap with a diverse
range of real-world formats and instructions, developed through an AI-Human
collaborative method. Our evaluation across both open-source (e.g., Llama 2,
WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three
key findings: open-source models significantly lag behind closed-source ones in
format adherence; LLMs' format-following performance is independent of their
content generation quality; and LLMs' format proficiency varies across
different domains. These insights suggest the need for specialized tuning for
format-following skills and highlight FoFo's role in guiding the selection of
domain-specific AI agents. FoFo is released here at
https://github.com/SalesforceAIResearch/FoFo.
Related papers
- FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware [4.480157114854711]
We present FVEval, the first comprehensive benchmark for characterizing large language models (LLMs) performance in tasks pertaining to formal verification (FV)
The benchmark consists of three sub-tasks that measure LLM capabilities at different levels.
We present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with FV.
arXiv Detail & Related papers (2024-10-15T21:48:57Z) - Open-domain Implicit Format Control for Large Language Model Generation [52.83173553689678]
We introduce a novel framework for controlled generation in large language models (LLMs)
This study investigates LLMs' capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers.
We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality.
arXiv Detail & Related papers (2024-08-08T11:51:45Z) - Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations [2.699900017799093]
We focus on fine tuning LLaMA, an open source LLM using proprietary documents and code from an enterprise repository.
As part of this work, we aim to guide beginners on how to start with fine tuning an LLM for documentation and code.
We also propose pre processing recipes for both documentation and code to prepare dataset in different formats.
arXiv Detail & Related papers (2024-03-23T13:25:01Z) - PANDA: Preference Adaptation for Enhancing Domain-Specific Abilities of LLMs [49.32067576992511]
Large language models often fall short of the performance achieved by domain-specific state-of-the-art models.
One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets.
We propose Preference Adaptation for Enhancing Domain-specific Abilities of LLMs (PANDA)
Our experimental results reveal that PANDA significantly enhances the domain-specific ability of LLMs on text classification and interactive decision tasks.
arXiv Detail & Related papers (2024-02-20T09:02:55Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.