Related papers: InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating

InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating

URL: http://arxiv.org/abs/2506.18102v1
Date: Sun, 22 Jun 2025 17:14:29 GMT
Title: InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
Authors: Fuyu Wang, Jiangtong Li, Kun Zhu, Changjun Jiang,
Abstract summary: Existing large language models (LLMs) focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity.<n>We propose a dual-component framework: $textbfInspireScore$, a novel evaluation system, and $textbfInspireDebate$, an optimized debating framework.<n>$textbfInspireScore$ achieves 44$%$ higher correlation with expert judgments compared to existing methods, while $textbfInspireDebate$ shows significant improvements.
Score: 15.096294311783836
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$\%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$\%$. Source code is available at https://github.com/fywang12/InspireDebate.

Related papers

Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks [2.3188831772813105]
We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates.<n>We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions.
arXiv Detail & Related papers (2025-07-23T17:58:14Z)
Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement [43.532921045069365]
Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks.<n>Current techniques, such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection to yield substantial and lasting performance gains.<n>We present a novel Reflect and Debate (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback.
arXiv Detail & Related papers (2025-06-04T03:52:20Z)
Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time [52.230936493691985]
We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
arXiv Detail & Related papers (2025-05-29T17:56:05Z)
Adaptive Thinking via Mode Policy Optimization for Social Language Agents [75.3092060637826]
We propose a framework to improve the adaptive thinking ability of language agents in dynamic social interactions.<n>Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing.
arXiv Detail & Related papers (2025-05-04T15:39:58Z)
Understanding Bias Reinforcement in LLM Agents Debate [28.36216398327389]
Large Language Models (LLMs) solve complex problems using training-free methods like prompt engineering and in-context learning.<n>Self-correction methods such as self-consistency and self-refinement aim to improve reliability.<n>We identify two key limitations: bias reinforcement and lack of perspective diversity.
arXiv Detail & Related papers (2025-03-21T02:51:30Z)
Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z)
Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought [61.588465852846646]
Chain-of-Thought (CoT) reasoning has emerged as a promising approach for enhancing the performance of large language models (LLMs) In this work, we introduce a novel reasoning boundary framework (RBF) to address these challenges.
arXiv Detail & Related papers (2024-10-08T05:26:28Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Evaluating the Performance of Large Language Models via Debates [43.40134389150456]
Large Language Models (LLMs) are rapidly evolving and impacting various fields.<n>Most current approaches for performance evaluation are either based on fixed, domain-specific questions, or rely on human input.<n>We propose an automated benchmarking framework based on debates between LLMs, judged by another LLM.<n>This method assesses not only domain knowledge, but also skills such as argumentative reasoning and inconsistency recognition.
arXiv Detail & Related papers (2024-06-16T19:02:31Z)
Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM [51.43102092480804]
Debatrix is an automated debate judge based on Large Language Models (LLMs) To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation.
arXiv Detail & Related papers (2024-03-12T18:19:47Z)
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z)
$\{\text{PF}\}^2\text{ES}$: Parallel Feasible Pareto Frontier Entropy Search for Multi-Objective Bayesian Optimization Under Unknown Constraints [4.672142224503371]
We present a novel information-theoretic acquisition function for multi-objective Bayesian optimization. $textPF2$ES provides a low cost and accurate estimate of the mutual information for the parallel setting. We benchmark $textPF2$ES across synthetic and real-life problems.
arXiv Detail & Related papers (2022-04-11T21:06:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.