Related papers: Comparing Plausibility Estimates in Base and Instruction-Tuned Large Language Models

Comparing Plausibility Estimates in Base and Instruction-Tuned Large Language Models

URL: http://arxiv.org/abs/2403.14859v1
Date: Thu, 21 Mar 2024 22:08:44 GMT
Title: Comparing Plausibility Estimates in Base and Instruction-Tuned Large Language Models
Authors: Carina Kauf, Emmanuele Chersoni, Alessandro Lenci, Evelina Fedorenko, Anna A. Ivanova,
Abstract summary: We compare base and instruction-tuned LLM performance on an English sentence plausibility task via explicit prompting and implicit estimation. Experiment 1 shows that, across model architectures and plausibility datasets, log likelihood ($textitLL$) scores are the most reliable indicator of sentence plausibility. Experiment 2 shows that $textitLL$ scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility.
Score: 50.15455336684986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-tuned LLMs can respond to explicit queries formulated as prompts, which greatly facilitates interaction with human users. However, prompt-based approaches might not always be able to tap into the wealth of implicit knowledge acquired by LLMs during pre-training. This paper presents a comprehensive study of ways to evaluate semantic plausibility in LLMs. We compare base and instruction-tuned LLM performance on an English sentence plausibility task via (a) explicit prompting and (b) implicit estimation via direct readout of the probabilities models assign to strings. Experiment 1 shows that, across model architectures and plausibility datasets, (i) log likelihood ($\textit{LL}$) scores are the most reliable indicator of sentence plausibility, with zero-shot prompting yielding inconsistent and typically poor results; (ii) $\textit{LL}$-based performance is still inferior to human performance; (iii) instruction-tuned models have worse $\textit{LL}$-based performance than base models. In Experiment 2, we show that $\textit{LL}$ scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility and providing a direct match to explicit human plausibility judgments. Overall, $\textit{LL}$ estimates remain a more reliable measure of plausibility in LLMs than direct prompting.

Related papers

CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation [31.469511576774252]
We propose a context-aware adaptive decoding method for large language models.<n>Our approach achieves a 2.8 percent average improvement on TruthfulQA.<n>Our model-agnostic, scalable, and efficient method requires only a single generation pass.
arXiv Detail & Related papers (2025-08-04T08:28:25Z)
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning [23.487484744911995]
We focus on evaluating the probabilistic reasoning capabilities of language models (LMs) using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities.
arXiv Detail & Related papers (2024-06-18T17:51:24Z)
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society. In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC. We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z)
QA-NatVer: Question Answering for Natural Logic-based Fact Verification [11.002475880349452]
We propose to use question answering to predict natural logic operators. In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points. A human evaluation indicates that our approach produces more plausible with fewer erroneous natural logic operators than previous natural logic-based systems.
arXiv Detail & Related papers (2023-10-22T06:27:31Z)
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
Evaluating statistical language models as pragmatic reasoners [39.72348730045737]
We evaluate the capacity of large language models to infer meanings of pragmatic utterances. We find that LLMs can derive context-grounded, human-like distributions over the interpretations of several complex pragmatic utterances. Results inform the inferential capacity of statistical language models, and their use in pragmatic and semantic parsing applications.
arXiv Detail & Related papers (2023-05-01T18:22:10Z)
MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text Generation [102.20036684996248]
We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. We conduct experiments on two data-to-text generation tasks like WebNLG and LogicNLG.
arXiv Detail & Related papers (2022-12-16T17:36:23Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
When Stability meets Sufficiency: Informative Explanations that do not Overwhelm [15.897648942908747]
We consider features-based attribution methods that highlight what should be minimally sufficient to justify the classification of an input. While minimal sufficiency is an attractive property akin to comprehensibility, the resulting explanations are often too sparse for a human to understand and evaluate the local behavior of the model. We propose a novel method called Path-Sufficient Explanations Method (PSEM) that outputs a sequence of stable and sufficient explanations for a given input.
arXiv Detail & Related papers (2021-09-13T16:06:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.