ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks
- URL: http://arxiv.org/abs/2412.09173v1
- Date: Thu, 12 Dec 2024 11:03:25 GMT
- Title: ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks
- Authors: Jiashu Yao, Heyan Huang, Zeming Liu, Haoyu Wen, Wei Su, Boao Qian, Yuhang Guo,
- Abstract summary: We present FormatBench, a format-related benchmark for large language models (LLMs)
Experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness.
We propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality.
- Score: 32.021938679807555
- License:
- Abstract: Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.
Related papers
- Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction [2.2999148299770047]
We explore the capabilities of large language models for zero- and few-shot learning on the ASQP task.
We report F1 scores slightly below those obtained with state-of-the-art fine-tuned models but exceeding previously reported zero- and few-shot performance.
arXiv Detail & Related papers (2025-02-18T16:56:15Z) - Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance [5.8538128016098225]
This paper investigates the fragility of Large Language Models (LLMs) in generalizing to novel inputs.
Despite high benchmark scores, LLMs exhibit significant accuracy drops and unexpected biases when faced with minor but content-preserving modifications.
arXiv Detail & Related papers (2025-02-18T02:42:53Z) - LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs [69.40865293066885]
We present the first systematic evaluation examining format bias in performance of large language models (LLMs)
We present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping.
arXiv Detail & Related papers (2024-08-16T10:45:45Z) - FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability [70.84333325049123]
FoFo is a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
This paper presents FoFo, a pioneering benchmark for evaluating large language models' (LLMs) ability to follow complex, domain-specific formats.
arXiv Detail & Related papers (2024-02-28T19:23:27Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies.
We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings.
We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Transforming Sequence Tagging Into A Seq2Seq Task [10.130389627403433]
We study different formats one could use for casting input text sentences into the input and target of a Seq2Seq model.
We introduce a new format, which we show to not only be simpler but also more effective.
We find that the new format is more robust and almost completely devoid of hallucination.
arXiv Detail & Related papers (2022-03-16T03:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.