Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation
- URL: http://arxiv.org/abs/2504.16408v1
- Date: Wed, 23 Apr 2025 04:19:52 GMT
- Title: Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation
- Authors: Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di,
- Abstract summary: We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III.<n>Our approach focuses on structured reasoning from only 24 labeled examples.<n>All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup.
- Score: 6.920352059545929
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The XLLM@ACL2025 Shared Task-III formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/Jiahao-Yuan/Less-is-More.
Related papers
- U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack [9.760456105567078]
This paper introduces U-NIAH, a unified framework that systematically compares Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG)<n>Our framework incorporates multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings.<n>Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness.
arXiv Detail & Related papers (2025-03-01T05:05:24Z) - Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs [32.45604456988931]
This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs)<n>We evaluate Llama-3 models of varying sizes on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches.<n>Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning.
arXiv Detail & Related papers (2025-02-13T02:51:17Z) - Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs [29.735465300269993]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often struggle with spatial reasoning.<n>This paper presents a novel neural-symbolic framework that enhances LLMs' spatial reasoning abilities through iterative feedback between LLMs and Answer Set Programming (ASP)<n>We evaluate our approach on two benchmark datasets: StepGame and SparQA.
arXiv Detail & Related papers (2024-11-27T18:04:05Z) - Divide-Verify-Refine: Can LLMs Self-Align with Complex Instructions? [33.18076221854853]
We propose a framework to divide complex instructions into single constraints and prepare appropriate tools.<n>We then verify responses using tools that provide rigorous check and textual guidance.<n>To maximize refinement effectiveness, we propose dynamic few-shot prompting, where a refinement repository collects successful refinements.
arXiv Detail & Related papers (2024-10-16T04:01:55Z) - LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning [60.55556283848063]
Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples.
Refinement offers an alternative by using LLM-generated feedback to improve solution quality.
We propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard.
arXiv Detail & Related papers (2024-09-18T17:12:41Z) - Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans.
We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems.
Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z) - Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency.
This approach fails in scenarios where the correct answers are in the minority.
We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Leveraging Large Language Models for Structure Learning in Prompted Weak
Supervision [24.866270447991752]
We show that our Structure Refining Module improves the PromptedWS pipeline by up to 12.7 points on the benchmark tasks.
We also explore the trade-offs between efficiency and performance with comprehensive ablation experiments and analysis.
arXiv Detail & Related papers (2024-02-02T19:45:39Z) - Towards Realistic Low-resource Relation Extraction: A Benchmark with
Empirical Baseline Study [51.33182775762785]
This paper presents an empirical study to build relation extraction systems in low-resource settings.
We investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; and (iii) data augmentation technologies and self-training to generate more labeled in-domain data.
arXiv Detail & Related papers (2022-10-19T15:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.