Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
- URL: http://arxiv.org/abs/2407.08440v4
- Date: Thu, 17 Oct 2024 07:00:19 GMT
- Title: Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models
- Authors: Wangtao Sun, Chenxiang Zhang, XueYou Zhang, Xuanqing Yu, Ziyang Huang, Pei Chen, Haotian Xu, Shizhu He, Jun Zhao, Kang Liu,
- Abstract summary: Large Language Models (LLMs) are supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent.
Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios.
This paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities.
- Score: 25.337295202341608
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Large Language Models (LLMs) have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/llm-rule-following-B3E3/
Related papers
- WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents [55.64361927346957]
We propose a neurosymbolic approach to learn rules gradient-free through large language models (LLMs)
Our embodied LLM agent "WALL-E" is built upon model-predictive control (MPC)
On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods.
arXiv Detail & Related papers (2024-10-09T23:37:36Z) - Few-Shot Fairness: Unveiling LLM's Potential for Fairness-Aware
Classification [7.696798306913988]
We introduce a framework outlining fairness regulations aligned with various fairness definitions.
We explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using RAG.
Experiments conducted with different LLMs indicate that GPT-4 delivers superior results in terms of both accuracy and fairness compared to other models.
arXiv Detail & Related papers (2024-02-28T17:29:27Z) - Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks.
However, their mastery of underlying inferential rules still falls short of human capabilities.
We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z) - PRE: A Peer Review Based Large Language Model Evaluator [14.585292530642603]
Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs.
We propose a novel framework that can automatically evaluate LLMs through a peer-review process.
arXiv Detail & Related papers (2024-01-28T12:33:14Z) - Enabling Large Language Models to Learn from Rules [99.16680531261987]
We are inspired that humans can learn the new tasks or knowledge in another way by learning from rules.
We propose rule distillation, which first uses the strong in-context abilities of LLMs to extract the knowledge from the textual rules.
Our experiments show that making LLMs learn from rules by our method is much more efficient than example-based learning in both the sample size and generalization ability.
arXiv Detail & Related papers (2023-11-15T11:42:41Z) - Can LLMs Follow Simple Rules? [28.73820874333199]
Rule-following Language Evaluation Scenarios (RuLES) is a framework for measuring rule-following ability in Large Language Models.
RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user.
We show that almost all current models struggle to follow scenario rules, even on straightforward test cases.
arXiv Detail & Related papers (2023-11-06T08:50:29Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated.
We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization.
The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z) - RuleBert: Teaching Soft Rules to Pre-trained Language Models [21.69870624809201]
We introduce a classification task where, given facts and soft rules, the PLM should return a prediction with a probability for a given hypothesis.
We propose a revised loss function that enables the PLM to learn how to predict precise probabilities for the task.
Our evaluation results show that the resulting fine-tuned models achieve very high performance, even on logical rules that were unseen at training.
arXiv Detail & Related papers (2021-09-24T16:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.