Related papers: IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

URL: http://arxiv.org/abs/2511.01014v1
Date: Sun, 02 Nov 2025 17:06:49 GMT
Title: IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
Authors: Bosi Wen, Yilin Niu, Cunxiang Wang, Pei Ke, Xiaoying Ling, Ying Zhang, Aohan Zeng, Hongning Wang, Minlie Huang,
Abstract summary: We propose IF-CRITIC, an evaluation model for instruction following in Large Language Models.<n>With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization.
Score: 87.38454788767545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.

Related papers

Rectifying LLM Thought from Lens of Optimization [48.98086817378953]
Long chain-of-thought (CoT) prompting enables thorough exploration and deliberation.<n>Despite advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors.<n>We introduce RePro, a novel approach to refine LLM reasoning during post-training.
arXiv Detail & Related papers (2025-12-01T17:41:08Z)
Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning [34.43632129774481]
In this paper, we quantify and investigate the potential reason -- imbalanced evaluation preference.<n>Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference.
arXiv Detail & Related papers (2025-11-13T13:37:45Z)
TuRTLe: A Unified Evaluation of LLMs for RTL Generation [0.6010802600885173]
We propose TuRTLe, a unified evaluation framework designed to assess LLMs across key RTL generation tasks.<n>We benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks.<n>Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria.
arXiv Detail & Related papers (2025-03-31T07:43:12Z)
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)<n>Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
RaCT: Ranking-aware Chain-of-Thought Optimization for LLMs [30.216174551427443]
Large language models (LLMs) have demonstrated remarkable potential in text reranking tasks.<n> conventional supervised fine-tuning approaches for specializing LLMs in ranking tasks often lead to significant degradation of the models' general-purpose abilities.<n>This paper presents a novel methodology that strategically combines Chain-of-Thought (CoT) prompting techniques with an innovative two-stage training pipeline.
arXiv Detail & Related papers (2024-12-18T23:24:15Z)
Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback. pipeline aggregates high-quality critiques from multiple agents instead of a single model. Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z)
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation [24.954629877691623]
TICK (Targeted Instruct-evaluation with ChecKlists) is a fully automated, interpretable evaluation protocol. We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists. We then show that STICK can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection.
arXiv Detail & Related papers (2024-10-04T17:09:08Z)
Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.<n>CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.<n>Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.