Related papers: How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments

How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments

URL: http://arxiv.org/abs/2408.09639v2
Date: Fri, 07 Feb 2025 07:02:26 GMT
Title: How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments
Authors: Yusuke Ide, Yuto Nishida, Justin Vasselli, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe,
Abstract summary: The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs.<n>Recent large language models (LLMs) are trained to perform tasks via prompting, and thus, the raw probabilities they assign may not fully reflect their grammatical knowledge.<n>This study attempts to derive more accurate judgments from LLMs using prompts and templates.
Score: 22.76776244036282
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where the LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches directly compare sentence probabilities assigned by LMs, but recent large language models (LLMs) are trained to perform tasks via prompting, and thus, the raw probabilities they assign may not fully reflect their grammatical knowledge. In this study, we attempt to derive more accurate acceptability judgments from LLMs using prompts and templates. Through extensive experiments in English and Chinese, we compare nine judgment methods and find two of them, a probability readout method -- in-template LP and a prompt-based method -- Yes/No probability computing, achieve higher accuracy than the conventional ones. Our analysis reveals that these methods excel in different linguistic phenomena, suggesting they access different aspects of LLMs' knowledge. We also find that ensembling the two methods outperforms single methods. Consequently, we recommend these techniques, either individually or ensembled, as more effective alternatives to conventional approaches for assessing grammatical knowledge in LLMs.

Related papers

Self-Correction Makes LLMs Better Parsers [19.20952673157709]
Large language models (LLMs) have achieved remarkable success across various natural language processing (NLP) tasks. Recent studies suggest that they still face challenges in performing fundamental NLP tasks essential for deep language understanding. We propose a self-correction method that leverages grammar rules from existing treebanks to guide LLMs in correcting previous errors.
arXiv Detail & Related papers (2025-04-19T03:50:59Z)
MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model [54.14155564592936]
We propose a Mixture of Rule Experts guided by a Large Language Model (MoRE-LLM) MoRE-LLM steers the discovery of local rule-based surrogates during training and their utilization for the classification task. LLM is responsible for enhancing the domain knowledge alignment of the rules by correcting and contextualizing them.
arXiv Detail & Related papers (2025-03-26T11:09:21Z)
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs [21.97227334180969]
"LLM-as-a-judge" paradigm employs Large Language Models as annotators and evaluators in tasks traditionally performed by humans. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. We propose a novel statistical procedure -- the Alternative Annotator Test (alt-test) -- that requires only a modest subset of annotated examples to justify using LLM annotations.
arXiv Detail & Related papers (2025-01-19T07:09:11Z)
Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment [53.17596274334017]
We evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs.<n>Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
arXiv Detail & Related papers (2024-10-06T08:33:39Z)
Learning-From-Mistakes Prompting for Indigenous Language Translation [3.7790255156708397]
This paper presents techniques to improve extremely low-resourced indigenous language translations. Our approaches are grounded in the use of a datastore consisting of a limited number of parallel translation examples. We harness the potential of LLMs and in-context learning techniques in such a setting for using LLMs as universal translators.
arXiv Detail & Related papers (2024-07-18T09:41:20Z)
TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks. We propose the TasTe framework, which stands for translating through self-reflection. The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z)
Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations. This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators. The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z)
Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs) It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Likelihood-based Mitigation of Evaluation Bias in Large Language Models [37.07596663793111]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics. It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z)
Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods. We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z)
LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs) We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
LLM-Rec: Personalized Recommendation via Prompting Large Language Models [62.481065357472964]
Large language models (LLMs) have showcased their ability to harness commonsense knowledge and reasoning. Recent advances in large language models (LLMs) have showcased their remarkable ability to harness commonsense knowledge and reasoning. This study introduces a novel approach, coined LLM-Rec, which incorporates four distinct prompting strategies of text enrichment for improving personalized text-based recommendations.
arXiv Detail & Related papers (2023-07-24T18:47:38Z)
Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models. We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.