Divide-Verify-Refine: Can LLMs Self-Align with Complex Instructions?
- URL: http://arxiv.org/abs/2410.12207v2
- Date: Thu, 27 Feb 2025 22:16:18 GMT
- Title: Divide-Verify-Refine: Can LLMs Self-Align with Complex Instructions?
- Authors: Xianren Zhang, Xianfeng Tang, Hui Liu, Zongyu Wu, Qi He, Dongwon Lee, Suhang Wang,
- Abstract summary: We propose a framework to divide complex instructions into single constraints and prepare appropriate tools.<n>We then verify responses using tools that provide rigorous check and textual guidance.<n>To maximize refinement effectiveness, we propose dynamic few-shot prompting, where a refinement repository collects successful refinements.
- Score: 33.18076221854853
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent studies show LLMs struggle with complex instructions involving multiple constraints (e.g., length, format, sentiment). Existing works address this issue by fine-tuning, which heavily relies on fine-tuning data quality and is computational expensive. An alternative is leveraging LLMs' self-correction to refine responses for better constraint adherence. However, this is limited by the feedback quality, as LLMs cannot generate reliable feedback or detect errors. Moreover, its effectiveness relies on few-shot examples illustrating response modifications. As constraints in complex instructions are diverse, manually crafting such examples for each constraint type can be labor-intensive and sub-optimal. To address these two challenges, we propose the Divide-Verify-Refine (DVR) framework with three steps: (1) Divide complex instructions into single constraints and prepare appropriate tools; (2) Verify responses using tools that provide rigorous check and textual guidance (e.g., Python toolkit for format checks or pre-trained classifiers for content analysis); (3) Refine: To maximize refinement effectiveness, we propose dynamic few-shot prompting, where a refinement repository collects successful refinements, and these examples are selectively retrieved for future refinements. Recognizing the lack of complexity in existing datasets, we create a new dataset of complex instructions. DVR doubles Llama3.1-8B's constraint adherence and triples Mistral-7B's performance.
Related papers
- Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation [6.920352059545929]
We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III.
Our approach focuses on structured reasoning from only 24 labeled examples.
All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup.
arXiv Detail & Related papers (2025-04-23T04:19:52Z) - Constraint Back-translation Improves Complex Instruction Following of Large Language Models [55.60192044049083]
Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc.
Previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs.
We propose a novel data generation technique, constraint back-translation.
arXiv Detail & Related papers (2024-10-31T17:42:26Z) - LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - Control Large Language Models via Divide and Conquer [94.48784966256463]
This paper investigates controllable generation for large language models (LLMs) with prompt-based control, focusing on Lexically Constrained Generation (LCG)
We evaluate the performance of LLMs on satisfying lexical constraints with prompt-based control, as well as their efficacy in downstream applications.
arXiv Detail & Related papers (2024-10-06T21:20:06Z) - Prompt Recursive Search: A Living Framework with Adaptive Growth in LLM Auto-Prompting [22.025533583703126]
We propose a novel Prompt Recursive Search (PRS) framework for large language models (LLMs)
PRS framework incorporates an assessment of problem complexity and an adjustable structure, ensuring a reduction in the likelihood of errors.
Compared to the Chain of Thought (CoT) method, the PRS method has increased the accuracy on the BBH dataset by 8% using Llama3-7B model, achieving a 22% improvement.
arXiv Detail & Related papers (2024-08-02T17:59:42Z) - Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem.
Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints.
We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z) - From Distributional to Overton Pluralism: Investigating Large Language Model Alignment [82.99849359892112]
We re-examine previously reported reductions in response diversity post-alignment.
Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation.
Findings indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior.
arXiv Detail & Related papers (2024-06-25T16:32:33Z) - CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks [15.60762281287532]
Large Language Models (LLMs) are revolutionizing various domains, yet verifying their answers remains a significant challenge.
In this work, we propose CheckEmbed: an accurate, scalable, and simple LLM verification approach.
CheckEmbed is driven by a straightforward yet powerful idea: compare their corresponding answer-level embeddings obtained with a model such as GPT Text Embedding Large.
arXiv Detail & Related papers (2024-06-04T17:42:21Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - Benchmarking Large Language Models on Controllable Generation under
Diversified Instructions [34.89012022437519]
Large language models (LLMs) have exhibited impressive instruction-following capabilities.
It is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions.
We propose a new benchmark CoDI-Eval to evaluate LLMs' responses to instructions with various constraints.
arXiv Detail & Related papers (2024-01-01T07:35:31Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by
Reversing Chain-of-Thought [56.558892336235914]
Reversing Chain-of-Thought (RCoT) is a novel method to improve large language models' reasoning abilities.
RCoT automatically detects and rectifys factual inconsistency in generated solutions.
We show that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities.
arXiv Detail & Related papers (2023-05-19T08:02:52Z) - Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot
Relation Extractors [11.28397947587596]
Fine-tuning large language models (LLMs) on large-scale instruction-following datasets substantially improves their performance on a wide range of NLP tasks.
However, even advanced instruction-tuned LLMs still fail to outperform small LMs on relation extraction (RE)
We propose QA4RE, a framework that aligns RE with question answering (QA), a predominant task in instruction-tuning datasets.
arXiv Detail & Related papers (2023-05-18T17:48:03Z) - Successive Prompting for Decomposing Complex Questions [50.00659445976735]
Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting.
We introduce Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution.
Our best model (with successive prompting) achieves an improvement of 5% absolute F1 on a few-shot version of the DROP dataset.
arXiv Detail & Related papers (2022-12-08T06:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.