Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format
- URL: http://arxiv.org/abs/2506.23133v1
- Date: Sun, 29 Jun 2025 08:11:52 GMT
- Title: Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format
- Authors: Dingzirui Wang, Xuanliang Zhang, Rongyu Cao, Longxu Dou, Xianzhen Luo, Yingwei Ma, Qingfu Zhu, Wanxiang Che, Binhua Li, Fei Huang, Yongbin Li,
- Abstract summary: Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers.<n>We adapt suitable formats to the given tasks by generating and selecting formats.<n>We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works.
- Score: 93.99600697438079
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.
Related papers
- ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks [32.021938679807555]
We present FormatBench, a format-related benchmark for large language models (LLMs)<n>Experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness.<n>We propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality.
arXiv Detail & Related papers (2024-12-12T11:03:25Z) - Does Prompt Formatting Have Any Impact on LLM Performance? [10.869929764785464]
This paper examines the impact of different prompt templates on Large Language Models (LLMs) performance.
We evaluated their impact across tasks like natural language reasoning, code generation, and translation using OpenAI's GPT models.
Experiments show that GPT-3.5-turbo's performance varies by up to 40% in a code translation task depending on the prompt template, while larger models like GPT-4 are more robust to these variations.
arXiv Detail & Related papers (2024-11-15T19:26:38Z) - From Lists to Emojis: How Format Bias Affects Model Alignment [67.08430328350327]
We study format biases in reinforcement learning from human feedback.<n>Many widely-used preference models, including human evaluators, exhibit strong biases towards specific format patterns.<n>We show that with a small amount of biased data, we can inject significant bias into the reward model.
arXiv Detail & Related papers (2024-09-18T05:13:18Z) - LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs [69.40865293066885]
We present the first systematic evaluation examining format bias in performance of large language models (LLMs)<n>We present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping.
arXiv Detail & Related papers (2024-08-16T10:45:45Z) - Handling Numeric Expressions in Automatic Speech Recognition [56.972851337263755]
We compare cascaded and end-to-end approaches to recognize and format numeric expressions.<n>Results show that adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
arXiv Detail & Related papers (2024-07-18T09:46:19Z) - Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting [68.19544657508509]
Large language models (LLMs) are adopted as a fundamental component of language technologies.
We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt format in few-shot settings.
We propose an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights.
arXiv Detail & Related papers (2023-10-17T15:03:30Z) - Forward-Backward Reasoning in Large Language Models for Mathematical Verification [65.9495774606273]
Self-Consistency samples diverse reasoning chains with answers and chooses the final answer by majority voting.
We introduce backward reasoning to verify candidate answers.
FOrward and BAckward Reasoning for verification achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-15T13:19:59Z) - Transforming Sequence Tagging Into A Seq2Seq Task [10.130389627403433]
We study different formats one could use for casting input text sentences into the input and target of a Seq2Seq model.
We introduce a new format, which we show to not only be simpler but also more effective.
We find that the new format is more robust and almost completely devoid of hallucination.
arXiv Detail & Related papers (2022-03-16T03:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.