Related papers: Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation

URL: http://arxiv.org/abs/2410.15173v1
Date: Sat, 19 Oct 2024 18:25:30 GMT
Title: Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Authors: Safeyah Khaled Alshemali, Daniel Bauer, Yuval Marton,
Abstract summary: We assess whether pre-trained autoregressive LLMs possess consistent, expressible knowledge about thematic fit. We evaluate both closed and open state-of-the-art LLMs on several psycholinguistic datasets. Our results show that chain-of-thought reasoning is more effective on datasets with self-explanatory semantic role labels.
Score: 0.09558392439655014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The thematic fit estimation task measures the compatibility between a predicate (typically a verb), an argument (typically a noun phrase), and a specific semantic role assigned to the argument. Previous state-of-the-art work has focused on modeling thematic fit through distributional or neural models of event representation, trained in a supervised fashion with indirect labels. In this work, we assess whether pre-trained autoregressive LLMs possess consistent, expressible knowledge about thematic fit. We evaluate both closed and open state-of-the-art LLMs on several psycholinguistic datasets, along three axes: (1) Reasoning Form: multi-step logical reasoning (chain-of-thought prompting) vs. simple prompting. (2) Input Form: providing context (generated sentences) vs. raw tuples <predicate, argument, role>. (3) Output Form: categorical vs. numeric. Our results show that chain-of-thought reasoning is more effective on datasets with self-explanatory semantic role labels, especially Location. Generated sentences helped only in few settings, and lowered results in many others. Predefined categorical (compared to numeric) output raised GPT's results across the board with few exceptions, but lowered Llama's. We saw that semantically incoherent generated sentences, which the models lack the ability to consistently filter out, hurt reasoning and overall performance too. Our GPT-powered methods set new state-of-the-art on all tested datasets.

Related papers

Inference and Verbalization Functions During In-Context Learning [7.544880309193842]
Large language models (LMs) are capable of in-context learning from a few demonstrations to solve new tasks during inference. Previous work has observed that, in some settings, ICL performance is minimally affected by irrelevant labels. We hypothesize that LMs perform ICL with irrelevant labels via two sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the label space.
arXiv Detail & Related papers (2024-10-12T03:31:37Z)
Topic Modeling with Fine-tuning LLMs and Bag of Sentences [1.8592384822257952]
FT-Topic is an unsupervised fine-tuning approach for topic modeling. SenClu is a state-of-the-art topic modeling method that achieves fast inference and hard assignments of sentence groups to a single topic.
arXiv Detail & Related papers (2024-08-06T11:04:07Z)
Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism [62.571419297164645]
This paper provides a systematic overview of prior works on the logical reasoning ability of large language models for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective. We then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets.
arXiv Detail & Related papers (2024-06-26T21:17:20Z)
How Abstract Is Linguistic Generalization in Large Language Models? Experiments with Argument Structure [2.530495315660486]
We investigate the degree to which pre-trained Transformer-based large language models represent relationships between contexts. We find that LLMs perform well in generalizing the distribution of a novel noun argument between related contexts. However, LLMs fail at generalizations between related contexts that have not been observed during pre-training.
arXiv Detail & Related papers (2023-11-08T18:58:43Z)
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society. In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC. We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z)
"I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models [0.0]
We evaluate two large language models (LLMs) ability to perform argumentative reasoning. We find that scoring-wise the LLMs match or surpass the SOTA in AM and APE. However, statistical analysis on the LLMs outputs when subject to small, yet still human-readable, alterations in the I/O representations showed that the models are not performing reasoning.
arXiv Detail & Related papers (2023-09-29T02:41:38Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
Not wacky vs. definitely wacky: A study of scalar adverbs in pretrained language models [0.0]
Modern pretrained language models, such as BERT, RoBERTa and GPT-3 hold the promise of performing better on logical tasks than classic static word embeddings. We investigate the extent to which BERT, RoBERTa, GPT-2 and GPT-3 exhibit general, human-like, knowledge of these common words. We find that despite capturing some aspects of logical meaning, the models fall far short of human performance.
arXiv Detail & Related papers (2023-05-25T18:56:26Z)
APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning [73.3035118224719]
We propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities. APOLLO performs comparably on ReClor and outperforms baselines on LogiQA.
arXiv Detail & Related papers (2022-12-19T07:40:02Z)
Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
L2R2: Leveraging Ranking for Abductive Reasoning [65.40375542988416]
The abductive natural language inference task ($alpha$NLI) is proposed to evaluate the abductive reasoning ability of a learning system. A novel $L2R2$ approach is proposed under the learning-to-rank framework. Experiments on the ART dataset reach the state-of-the-art in the public leaderboard.
arXiv Detail & Related papers (2020-05-22T15:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.