Related papers: GIER: Gap-Driven Self-Refinement for Large Language Models

GIER: Gap-Driven Self-Refinement for Large Language Models

URL: http://arxiv.org/abs/2509.00325v1
Date: Sat, 30 Aug 2025 02:54:08 GMT
Title: GIER: Gap-Driven Self-Refinement for Large Language Models
Authors: Rinku Dewri,
Abstract summary: GIER (Gap-driven Iterative Enhancement of Responses) is a framework for improving large language model (LLM) outputs through self-reflection and revision.<n>GIER improves rationale quality, grounding, and reasoning alignment without degrading task accuracy.<n>Our analysis demonstrates that models can not only interpret abstract conceptual gaps but also translate them into concrete reasoning improvements.
Score: 0.8460698440162889
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce GIER (Gap-driven Iterative Enhancement of Responses), a general framework for improving large language model (LLM) outputs through self-reflection and revision based on conceptual quality criteria. Unlike prompting strategies that rely on demonstrations, examples, or chain-of-thought templates, GIER utilizes natural language descriptions of reasoning gaps, and prompts a model to iteratively critique and refine its own outputs to better satisfy these criteria. Across three reasoning-intensive tasks (SciFact, PrivacyQA, and e-SNLI) and four LLMs (GPT-4.1, GPT-4o Mini, Gemini 1.5 Pro, and Llama 3.3 70B), GIER improves rationale quality, grounding, and reasoning alignment without degrading task accuracy. Our analysis demonstrates that models can not only interpret abstract conceptual gaps but also translate them into concrete reasoning improvements.

Related papers

OneRec-Think: In-Text Reasoning for Generative Recommendation [55.53292983432484]
OneRec-Think is a unified framework that seamlessly integrates dialogue, reasoning, and personalized recommendation.<n>Our proposed "Think-Ahead" architecture enables effective industrial deployment on Kuaishou, achieving a 0.159% gain in APP Stay Time.
arXiv Detail & Related papers (2025-10-13T17:20:13Z)
Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models [5.584522240405349]
This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment.<n>We evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations.<n>Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs.
arXiv Detail & Related papers (2025-08-27T18:54:32Z)
ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models [76.28894983518164]
Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs)<n>They often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers.<n>We introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains.
arXiv Detail & Related papers (2025-08-17T14:50:23Z)
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs [54.154593699263074]
ProtoReasoning is a framework that enhances the reasoning ability of Large Reasoning Models.<n>ProtoReasoning transforms problems into corresponding prototype representations.<n>ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning.
arXiv Detail & Related papers (2025-06-18T07:44:09Z)
Self-Critique and Refinement for Faithful Natural Language Explanations [17.8004479689826]
We introduce Self-critique and Refinement for Natural Language Explanations.<n>This framework enables models to improve the faithfulness of their own explanations.<n>We show that SR-NLE significantly reduces unfaithfulness rates.
arXiv Detail & Related papers (2025-05-28T20:08:42Z)
A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z)
Learning to Refine with Fine-Grained Natural Language Feedback [81.70313509881315]
We propose looking at refinement with feedback as a composition of three distinct LLM competencies.<n>A key property of the proposed Detect, Critique, Refine ("DCR") method is that the step 2 critique model can give fine-grained feedback about errors.<n>We show that models of different capabilities benefit from refining with DCR on the task of improving factual consistency of document grounded summaries.
arXiv Detail & Related papers (2024-07-02T16:15:01Z)
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification [41.330719056639616]
We study the entailment verification problem of multi-sentence premises. Modern NLP problems, such as detecting inconsistent model-generated rationales, require complex multi-hop reasoning.
arXiv Detail & Related papers (2024-02-06T04:14:09Z)
Self-Discover: Large Language Models Self-Compose Reasoning Structures [136.48389510481758]
We introduce SELF-DISCOVER, a framework for self-discovering task-intrinsic reasoning structures. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks. We show that the self-discovered reasoning structures are universally applicable across model families.
arXiv Detail & Related papers (2024-02-06T01:13:53Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
GLoRE: Evaluating Logical Reasoning of Large Language Models [20.77694584450457]
We introduce GLoRE, a platform that consolidates diverse datasets and standardizes them into a unified format for evaluating large language models.<n>Our experimental results show that compared to the performance of humans and supervised fine-tuning models, the logical reasoning capabilities of large reasoning models, such as OpenAI's o1 mini, DeepSeek R1 and QwQ-32B, have seen remarkable improvements.
arXiv Detail & Related papers (2023-10-13T13:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.