Related papers: LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs

LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs

URL: http://arxiv.org/abs/2408.08656v1
Date: Fri, 16 Aug 2024 10:45:45 GMT
Title: LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
Authors: Do Xuan Long, Hai Nguyen Ngoc, Tiviatis Sim, Hieu Dao, Shafiq Joty, Kenji Kawaguchi, Nancy F. Chen, Min-Yen Kan,
Abstract summary: We present the first systematic evaluation examining format bias in performance of large language models (LLMs) We present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping.
Score: 69.40865293066885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first systematic evaluation examining format bias in performance of large language models (LLMs). Our approach distinguishes between two categories of an evaluation metric under format constraints to reliably and accurately assess performance: one measures performance when format constraints are adhered to, while the other evaluates performance regardless of constraint adherence. We then define a metric for measuring the format bias of LLMs and establish effective strategies to reduce it. Subsequently, we present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping -- covering 15 widely-used formats. Our evaluation on eight generation tasks uncovers significant format bias across state-of-the-art LLMs. We further discover that improving the format-instruction following capabilities of LLMs across formats potentially reduces format bias. Based on our evaluation findings, we study prompting and fine-tuning with synthesized format data techniques to mitigate format bias. Our methods successfully reduce the variance in ChatGPT's performance among wrapping formats from 235.33 to 0.71 (%$^2$).

Related papers

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon [11.753349115726952]
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that distorts benchmark prompts. By rephrasing inputs while preserving semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns.
arXiv Detail & Related papers (2025-02-11T10:43:36Z)
Verifiable Format Control for Large Language Model Generations [24.789801375314664]
Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. Small LLMs with about 7B parameters struggle fine-grained format following (e.g., verifiable format)
arXiv Detail & Related papers (2025-02-06T20:57:36Z)
ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks [32.021938679807555]
We present FormatBench, a format-related benchmark for large language models (LLMs) Experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. We propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality.
arXiv Detail & Related papers (2024-12-12T11:03:25Z)
From Lists to Emojis: How Format Bias Affects Model Alignment [67.08430328350327]
We study format biases in reinforcement learning from human feedback. Many widely-used preference models, including human evaluators, exhibit strong biases towards specific format patterns. We show that with a small amount of biased data, we can inject significant bias into the reward model.
arXiv Detail & Related papers (2024-09-18T05:13:18Z)
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models [47.545382591646565]
Large Language Models (LLMs) have excelled at language understanding and generating human-level text. LLMs are susceptible to adversarial attacks where malicious users prompt the model to generate undesirable text. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs.
arXiv Detail & Related papers (2024-08-07T17:11:34Z)
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models [59.970391602080205]
This study investigates whether such constraints on generation space impact LLMs abilities, including reasoning and domain knowledge comprehension. We evaluate LLMs performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. We find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
arXiv Detail & Related papers (2024-08-05T13:08:24Z)
Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study [14.906150451947443]
We study two many-shot ICL prompts for helping evaluators to mitigate the potential biases in Large Language Models (LLMs) Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime.
arXiv Detail & Related papers (2024-06-17T15:11:58Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Likelihood-based Mitigation of Evaluation Bias in Large Language Models [37.07596663793111]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings. We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.