Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment
- URL: http://arxiv.org/abs/2410.09083v2
- Date: Tue, 20 May 2025 15:29:37 GMT
- Title: Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment
- Authors: Lu Chen, Yuxuan Huang, Yixing Li, Dongrui Liu, Qihan Ren, Shuai Zhao, Kun Kuang, Zilong Zheng, Quanshi Zhang,
- Abstract summary: We evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs.<n>Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
- Score: 53.17596274334017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a method to analyze the inference patterns used by Large Language Models (LLMs) for judgment in a case study on legal LLMs, so as to identify potential incorrect representations of the LLM, according to human domain knowledge. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs. To this end, we quantify the interactions between input phrases used by the LLM as primitive inference patterns, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed inference patterns of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
Related papers
- Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations [7.81820080453498]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.<n>We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic.
arXiv Detail & Related papers (2025-07-13T19:05:43Z) - MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model [54.14155564592936]
We propose a Mixture of Rule Experts guided by a Large Language Model (MoRE-LLM)<n>MoRE-LLM steers the discovery of local rule-based surrogates during training and their utilization for the classification task.<n>LLM is responsible for enhancing the domain knowledge alignment of the rules by correcting and contextualizing them.
arXiv Detail & Related papers (2025-03-26T11:09:21Z) - Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement [22.992484902761994]
This study systematically evaluates the performance of multiple Large Language Models (LLMs) in detecting offensive language.<n>We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making.
arXiv Detail & Related papers (2025-02-10T07:14:26Z) - Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models [31.558429029429863]
We study logical consistency of Large Language Models (LLMs) as a prerequisite for more reliable and trustworthy systems.
We first propose a universal framework to quantify the logical consistency via three fundamental proxies: transitivity, commutativity and negation invariance.
We then evaluate logical consistency, using the defined measures, of a wide range of LLMs, demonstrating that it can serve as a strong proxy for overall robustness.
arXiv Detail & Related papers (2024-10-03T04:34:04Z) - Proof of Thought : Neurosymbolic Program Synthesis allows Robust and Interpretable Reasoning [1.3003982724617653]
Large Language Models (LLMs) have revolutionized natural language processing, yet they struggle with inconsistent reasoning.
This research introduces Proof of Thought, a framework that enhances the reliability and transparency of LLM outputs.
Key contributions include a robust type system with sort management for enhanced logical integrity, explicit representation of rules for clear distinction between factual and inferential knowledge.
arXiv Detail & Related papers (2024-09-25T18:35:45Z) - Logic-Enhanced Language Model Agents for Trustworthy Social Simulations [3.5083201638203154]
This study focuses on decision-making in game-theoretic scenarios as a model of human interaction.
We introduce the Logic-Enhanced Language Model Agents (LELMA) framework, a novel approach to enhance the trustworthiness of social simulations.
arXiv Detail & Related papers (2024-08-28T18:25:35Z) - Through the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models [0.0]
Large Language Models (LLMs) have shown exceptional performance in text processing.
This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble.
We generate outputs for fine-tuning, enhancing the model's ability to classify and explain its decisions.
arXiv Detail & Related papers (2024-06-07T13:31:51Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Quantifying In-Context Reasoning Effects and Memorization Effects in LLMs [101.51435599249234]
We propose an axiomatic system to define and quantify the precise memorization and in-context reasoning effects used by the large language model (LLM)
Specifically, the axiomatic system enables us to categorize the memorization effects into foundational memorization effects and chaotic memorization effects.
Experiments show that the clear disentanglement of memorization effects and in-context reasoning effects enables a straightforward examination of detailed inference patterns encoded by LLMs.
arXiv Detail & Related papers (2024-05-20T08:51:03Z) - Argumentative Large Language Models for Explainable and Contestable Decision-Making [13.045050015831903]
Large language models (LLMs) are a promising candidate for use in decision-making.
They are limited by their inability to reliably provide outputs which are explainable and contestable.
We introduce argumentative LLMs, a method utilising LLMs to construct argumentation frameworks.
We demonstrate the effectiveness of argumentative LLMs experimentally in the decision-making task of claim verification.
arXiv Detail & Related papers (2024-05-03T13:12:28Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks.
Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Enhancing Logical Reasoning in Large Language Models to Facilitate Legal
Applications [4.062485135201161]
Large Language Models (LLMs) attempt to emulate human language understanding and generation, but their competency in logical reasoning remains limited.
This paper seeks to address the philosophical question: How can we effectively teach logical reasoning to LLMs?
By focusing on bolstering LLMs' capabilities in logical reasoning, we aim to expand their applicability in law and other logic-intensive disciplines.
arXiv Detail & Related papers (2023-11-22T01:51:50Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - Improving Large Language Models in Event Relation Logical Prediction [33.88499005859982]
Event relation extraction is a challenging task that demands thorough semantic understanding and rigorous logical reasoning.
In this paper, we conduct an in-depth investigation to systematically explore the capability of LLMs in understanding and applying event relation logic.
Our study reveals that LLMs are not logically consistent reasoners, which results in their suboptimal performance on tasks that need rigorous reasoning.
arXiv Detail & Related papers (2023-10-13T14:53:06Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z) - ThinkSum: Probabilistic reasoning over sets using large language models [18.123895485602244]
We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner.
We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks.
arXiv Detail & Related papers (2022-10-04T00:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.