Related papers: Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

URL: http://arxiv.org/abs/2407.08440v4
Date: Thu, 17 Oct 2024 07:00:19 GMT
Title: Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
Authors: Wangtao Sun, Chenxiang Zhang, XueYou Zhang, Xuanqing Yu, Ziyang Huang, Pei Chen, Haotian Xu, Shizhu He, Jun Zhao, Kang Liu,
Abstract summary: Large Language Models (LLMs) are supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. This paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities.
Score: 25.337295202341608
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Large Language Models (LLMs) have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/llm-rule-following-B3E3/

Related papers

Learned-Rule-Augmented Large Language Model Evaluators [5.4343364964031124]
Large language models (LLMs) are predominantly used as evaluators for natural language generation (NLG) tasks.<n>This work explores the potential of LLMs as general evaluators across diverse tasks.
arXiv Detail & Related papers (2025-12-01T18:08:45Z)
RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models [13.343944091570386]
Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning.<n>We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules.<n>Applying rules directly with their learned weights yields superior performance, whereas prompting LLMs with the rules, weights, and logistic-model outputs surprisingly degrades accuracy.
arXiv Detail & Related papers (2025-10-22T15:50:04Z)
Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? [5.542420010310746]
A critical, yet understudied, issue is the potential divergence between an LLM's stated preferences and its revealed preferences.<n>This work formally defines and proposes a method to measure this preference deviation.<n>Our study will be crucial for integrating LLMs into services, especially those that interact directly with humans.
arXiv Detail & Related papers (2025-05-31T23:38:48Z)
Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP [1.5567685129899713]
Large language models (LLMs) can amplify misinformation, undermining societal goals like the UN.<n>We study three documented drivers of misinformation (valence framing, information overload) which are often shaped by one's default beliefs.<n>Building on evidence that LLMs encode defaults, we ask: can general belief-drivens behind misinformative behaviour be recovered from LLMs as clear rules?
arXiv Detail & Related papers (2025-05-16T12:48:44Z)
Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation [44.58099275559231]
Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges.
arXiv Detail & Related papers (2025-03-24T19:24:40Z)
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [50.16340812031201]
We show that large language models (LLMs) do not update their beliefs as expected from the Bayesian framework. We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
InductionBench: LLMs Fail in the Simplest Complexity Class [53.70978746199222]
Large language models (LLMs) have shown remarkable improvements in reasoning. Inductive reasoning, where one infers the underlying rules from observed data, remains less explored. We introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs.
arXiv Detail & Related papers (2025-02-20T03:48:00Z)
Training Large Language Models to be Better Rule Followers [23.958458849973248]
Large language models (LLMs) have shown impressive performance across a wide range of tasks. Current training methods fail to leverage these rules effectively. We propose Meta Rule-Following Fine-Tuning (Meta-RFFT) to enhance the cross-task transferability of rule-following abilities.
arXiv Detail & Related papers (2025-02-17T07:54:50Z)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z)
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents [55.64361927346957]
We propose a neurosymbolic approach to learn rules gradient-free through large language models (LLMs) Our embodied LLM agent "WALL-E" is built upon model-predictive control (MPC) On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods.
arXiv Detail & Related papers (2024-10-09T23:37:36Z)
Few-Shot Fairness: Unveiling LLM's Potential for Fairness-Aware Classification [7.696798306913988]
We introduce a framework outlining fairness regulations aligned with various fairness definitions. We explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using RAG. Experiments conducted with different LLMs indicate that GPT-4 delivers superior results in terms of both accuracy and fairness compared to other models.
arXiv Detail & Related papers (2024-02-28T17:29:27Z)
Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z)
PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs. We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z)
Enabling Large Language Models to Learn from Rules [99.16680531261987]
We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules. We propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules. Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability.
arXiv Detail & Related papers (2023-11-15T11:42:41Z)
Can LLMs Follow Simple Rules? [28.73820874333199]
Rule-following Language Evaluation Scenarios (RuLES) is a framework for measuring rule-following ability in Large Language Models. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. We show that almost all current models struggle to follow scenario rules, even on straightforward test cases.
arXiv Detail & Related papers (2023-11-06T08:50:29Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated. We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z)
RuleBert: Teaching Soft Rules to Pre-trained Language Models [21.69870624809201]
We introduce a classification task where, given facts and soft rules, the PLM should return a prediction with a probability for a given hypothesis. We propose a revised loss function that enables the PLM to learn how to predict precise probabilities for the task. Our evaluation results show that the resulting fine-tuned models achieve very high performance, even on logical rules that were unseen at training.
arXiv Detail & Related papers (2021-09-24T16:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.