Related papers: Predicting Sentence Acceptability Judgments in Multimodal Contexts

Predicting Sentence Acceptability Judgments in Multimodal Contexts

URL: http://arxiv.org/abs/2602.20918v1
Date: Tue, 24 Feb 2026 13:54:38 GMT
Title: Predicting Sentence Acceptability Judgments in Multimodal Contexts
Authors: Hyewon Jang, Nikolai Ilinykh, Sharid Loáiciga, Jey Han Lau, Shalom Lappin,
Abstract summary: Previous work has examined the capacity of deep neural networks (DNNs) to predict human sentence acceptability judgments.<n>We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs)<n>Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings.
Score: 22.053970196200925
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.

Related papers

Positional Biases Shift as Inputs Approach Context Window Limits [57.00239097102958]
The LiM effect is strongest when inputs occupy up to 50% of a model's context window.<n>We observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input.
arXiv Detail & Related papers (2025-08-10T20:40:24Z)
LLM Agents Display Human Biases but Exhibit Distinct Learning Patterns [0.0]
We investigate the choice patterns of Large Language Models (LLMs) in the context of Decisions from Experience tasks.<n>We find that on the aggregate, LLMs appear to display behavioral biases similar to humans.<n>However, more nuanced analyses of the choice patterns reveal that this happens for very different reasons.
arXiv Detail & Related papers (2025-03-13T10:47:03Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities [0.0]
We compare data generated with mono- and multilingual LLMs spanning a range of model sizes with data provided by human participants.<n>We aim to develop a benchmark to assess the capabilities of LLMs with discourse biases as a robust proxy for more general discourse understanding capabilities.
arXiv Detail & Related papers (2025-01-22T16:07:24Z)
Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)<n>In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.<n>Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content. We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Multimodality and Attention Increase Alignment in Natural Language Prediction Between Humans and Computational Models [0.8139163264824348]
Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words. multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities. We show that predictability estimates from humans aligned more closely with scores generated from multimodal models vs. their unimodal counterparts.
arXiv Detail & Related papers (2023-08-11T09:30:07Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Can Large Language Models Capture Dissenting Human Voices? [7.668954669688971]
Large language models (LLMs) have shown impressive achievements in solving a broad range of tasks. We evaluate the performance and alignment of LLM distribution with humans using two different techniques. We show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution.
arXiv Detail & Related papers (2023-05-23T07:55:34Z)
Human Behavioral Benchmarking: Numeric Magnitude Comparison Effects in Large Language Models [4.412336603162406]
Large Language Models (LLMs) do not differentially represent numbers, which are pervasive in text. In this work, we investigate how well popular LLMs capture the magnitudes of numbers from a behavioral lens.
arXiv Detail & Related papers (2023-05-18T07:50:44Z)
Attention-likelihood relationship in transformers [2.8304391396200064]
We analyze how large language models (LLMs) represent out-of-context words, investigating their reliance on the given context to capture their semantics. Our likelihood-guided text perturbations reveal a correlation between token likelihood and attention values in transformer-based language models.
arXiv Detail & Related papers (2023-03-15T00:23:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.