NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals
- URL: http://arxiv.org/abs/2502.08080v2
- Date: Fri, 07 Mar 2025 15:17:43 GMT
- Title: NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals
- Authors: Neha Srikanth, Rachel Rudinger,
- Abstract summary: We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems.<n>These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning.<n>Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems.
- Score: 19.300202585383914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decomposition of text into atomic propositions is a flexible framework allowing for the closer inspection of input and output text. We use atomic decomposition of hypotheses in two natural language reasoning tasks, traditional NLI and defeasible NLI, to form atomic sub-problems, or granular inferences that models must weigh when solving the overall problem. These atomic sub-problems serve as a tool to further understand the structure of both NLI and defeasible reasoning, probe a model's consistency and understanding of different inferences, and measure the diversity of examples in benchmark datasets. Our results indicate that LLMs still struggle with logical consistency on atomic NLI and defeasible NLI sub-problems. Lastly, we identify critical atomic sub-problems of defeasible NLI examples, or those that most contribute to the overall label, and propose a method to measure the inferential consistency of a model, a metric designed to capture the degree to which a model makes consistently correct or incorrect predictions about the same fact under different contexts.
Related papers
- Statistical Hypothesis Testing for Auditing Robustness in Language Models [49.1574468325115]
We introduce distribution-based perturbation analysis, a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling.<n>We show how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
arXiv Detail & Related papers (2025-06-09T17:11:07Z) - Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? [68.72260770171212]
We propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps.
Our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking.
We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs.
arXiv Detail & Related papers (2025-03-08T15:23:47Z) - FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models [59.171510592986735]
We propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response.
Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches.
arXiv Detail & Related papers (2025-02-25T19:01:48Z) - On Reference (In-)Determinacy in Natural Language Inference [62.904689974282334]
We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI)<n>We observe that current NLI models fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts.<n>We introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples.
arXiv Detail & Related papers (2025-02-09T06:58:13Z) - JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.<n>JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.<n>Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning [68.65389926175506]
We propose a novel paradigm of Self-structured Chain of Thought (SCoT)<n>Our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomena of overthinking for easier tasks.<n>We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs.
arXiv Detail & Related papers (2024-11-18T11:54:58Z) - NLP Verification: Towards a General Methodology for Certifying Robustness [9.897538432223714]
Machine Learning (ML) has exhibited substantial success in the field of Natural Language Processing (NLP)<n>As these systems are increasingly integrated into real-world applications, ensuring their safety and reliability becomes a primary concern.<n>We propose a general methodology to analyse the effect of the embedding gap, a problem that refers to the discrepancy between verification of geometric subspaces and the semantic meaning of sentences.
arXiv Detail & Related papers (2024-03-15T09:43:52Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Atomic Inference for NLI with Generated Facts as Atoms [26.320297488995262]
Atomic inference provides interpretable and faithful model decisions.
This approach involves making predictions for different components (or atoms) of an instance, before using interpretable and deterministic rules to derive the overall prediction.
We investigate the effectiveness of using LLM-generated facts as atoms, decomposing Natural Language Inference premises into lists of facts.
arXiv Detail & Related papers (2023-05-22T16:45:50Z) - Benchmarking Faithfulness: Towards Accurate Natural Language
Explanations in Vision-Language Tasks [0.0]
Natural language explanations (NLEs) promise to enable the communication of a model's decision-making in an easily intelligible way.
While current models successfully generate convincing explanations, it is an open question how well the NLEs actually represent the reasoning process of the models.
We propose three faithfulness metrics: Attribution-Similarity, NLE-Sufficiency, and NLE-Comprehensiveness.
arXiv Detail & Related papers (2023-04-03T08:24:10Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Decomposing Natural Logic Inferences in Neural NLI [9.606462437067984]
We investigate whether neural NLI models capture the crucial semantic features central to natural logic: monotonicity and concept inclusion.
We find that monotonicity information is notably weak in the representations of popular NLI models which achieve high scores on benchmarks.
arXiv Detail & Related papers (2021-12-15T17:35:30Z) - Disentangling Observed Causal Effects from Latent Confounders using
Method of Moments [67.27068846108047]
We provide guarantees on identifiability and learnability under mild assumptions.
We develop efficient algorithms based on coupled tensor decomposition with linear constraints to obtain scalable and guaranteed solutions.
arXiv Detail & Related papers (2021-01-17T07:48:45Z) - Neural Natural Language Inference Models Partially Embed Theories of
Lexical Entailment and Negation [14.431925736607043]
We present Monotonicity NLI (MoNLI), a new naturalistic dataset focused on lexical entailment and negation.
In behavioral evaluations, we find that models trained on general-purpose NLI datasets fail systematically on MoNLI examples containing negation.
In structural evaluations, we look for evidence that our top-performing BERT-based model has learned to implement the monotonicity algorithm behind MoNLI.
arXiv Detail & Related papers (2020-04-30T07:53:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.