Related papers: Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments

URL: http://arxiv.org/abs/2510.06001v1
Date: Tue, 07 Oct 2025 15:03:09 GMT
Title: Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments
Authors: Timothy Pistotti, Jason Brown, Michael Witbrock,
Abstract summary: This paper argues that the direct minimal pair approach offers greater diagnostic transparency.<n>We show GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles.
Score: 9.161468569386708
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies probing the Argument from the Poverty of the Stimulus (APS) have applied Large Language Models (LLMs) to test the learnability of complex syntax through surprisal-based metrics. However, divergent conclusions raise questions concerning the insights these metrics offer. While Wilcox et al. (2024) used direct minimal pair comparisons (the "wh-effect") to demonstrate that models successfully generalise knowledge of filler-gap dependencies, Lan et al. (2024) used a Difference-in-Differences (DiD) metric and found that models largely fail on parasitic gaps (PGs). This paper argues that the direct minimal pair approach offers greater diagnostic transparency. We demonstrate this by generating a full 8-permutation paradigm of refined PG stimuli and evaluating the GPT-2 model used in previous studies with a systematic Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across all four tested conditions, indicating robust knowledge of filler-gap licensing principles even in complex PG environments. This finding, which contrasts with the more ambiguous results from DiD-style metrics, suggests that the choice of evaluation metric is critical for assessing an LLM's syntactic competence.

Related papers

Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance [9.161468569386708]
This paper investigates the hypothesis that stimuli characteristics, including lexical ambiguities and structural complexities, may confound model performance.<n>A methodology is proposed for re-evaluating LLM competence on syntactic prediction, focusing on GPT-2.<n>Preliminary findings indicate that GPT-2 demonstrates notably improved performance on these refined PG stimuli compared to baselines.
arXiv Detail & Related papers (2025-10-07T15:16:47Z)
From Harm to Help: Turning Reasoning In-Context Demos into Assets for Reasoning LMs [58.02809208460186]
We revisit this paradox using high-quality traces from DeepSeek-R1 as demonstrations.<n>We find that adding more exemplars consistently degrades accuracy, even when demonstrations are optimal.<n>We introduce Insight-to-solve (I2S), a sequential test-time procedure that turns demonstrations into explicit, reusable insights.
arXiv Detail & Related papers (2025-09-27T08:59:31Z)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering [30.12049172634714]
We introduce OKGQA, a benchmark specifically designed to evaluate Large Language Models augmented with Knowledge Graphs.<n> OKGQA reflects practical complexities through diverse question types and incorporates metrics to quantify both hallucination rates and reasoning improvements.<n>We propose a benchmark variant, OKGQA-P, to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated.
arXiv Detail & Related papers (2024-10-10T16:29:21Z)
Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation [0.0]
This study explores the capability of Large Language Models to evaluate causality in causal graphs.<n>Our study compares two approaches: (1) prompting-based method for zero-shot and few-shot causal inference and, (2) fine-tuning language models for the causal relation prediction task.
arXiv Detail & Related papers (2024-05-29T09:06:18Z)
Quantifying Semantic Emergence in Language Models [31.608080868988825]
Large language models (LLMs) are widely recognized for their exceptional capacity to capture semantics meaning.<n>In this work, we introduce a quantitative metric, Information Emergence (IE), designed to measure LLMs' ability to extract semantics from input tokens.
arXiv Detail & Related papers (2024-05-21T09:12:20Z)
Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4 [10.01547158445743]
We evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Efficient Fine-Tuning (PEFT) We found that the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328.
arXiv Detail & Related papers (2024-03-30T22:27:21Z)
Large Language Models in Medical Term Classification and Unexpected Misalignment Between Response and Reasoning [28.355000184014084]
This study assesses the ability of state-of-the-art large language models (LLMs) to identify patients with mild cognitive impairment (MCI) from discharge summaries. The data was partitioned into training, validation, and testing sets in a 7:2:1 ratio for model fine-tuning and evaluation. Open-source models like Falcon and LLaMA 2 achieved high accuracy but lacked explanatory reasoning.
arXiv Detail & Related papers (2023-12-19T17:36:48Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.