Related papers: Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments

Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments

URL: http://arxiv.org/abs/2504.07965v1
Date: Thu, 10 Apr 2025 17:59:57 GMT
Title: Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments
Authors: Lorenz Linhardt, Tom Neuhäuser, Lenka Tětková, Oliver Eberle,
Abstract summary: We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments.<n>We find that even the representations of small language models can achieve human-level alignment.
Score: 4.856605865350511
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Small and mid-sized generative language models have gained increasing attention. Their size and availability make them amenable to being analyzed at a behavioral as well as a representational level, allowing investigations of how these levels interact. We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments on a word triplet task. This provides a novel evaluation setting to probe semantic associations in language beyond common pairwise comparisons. We find that (1) even the representations of small language models can achieve human-level alignment, (2) instruction-tuned model variants can exhibit substantially increased agreement, (3) the pattern of alignment across layers is highly model dependent, and (4) alignment based on models' behavioral responses is highly dependent on model size, matching their representational alignment only for the largest evaluated models.

Related papers

Do language models accommodate their users? A study of linguistic convergence [15.958711524171362]
We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline.<n>We observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained counterparts.
arXiv Detail & Related papers (2025-08-05T09:55:40Z)
DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.<n>We show that DevBench provides a benchmark for comparing models to human language development.<n>These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z)
A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors [50.046717886067555]
We show that when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
arXiv Detail & Related papers (2024-06-14T17:38:21Z)
Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences. We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z)
On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z)
Can language models handle recursively nested grammatical structures? A case study on comparing models and humans [0.0]
How should we compare the capabilities of language models and humans? I consider a case study: processing of nested grammatical structures. I suggest that there is an important difference between evaluating cognitive models of a specific phenomenon and evaluating broadly-trained models.
arXiv Detail & Related papers (2022-10-27T10:25:12Z)
Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text. We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality. We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z)
A General Language Assistant as a Laboratory for Alignment [3.3598752405752106]
We study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. We study a preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.
arXiv Detail & Related papers (2021-12-01T22:24:34Z)
Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z)
A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English [17.993417004424078]
Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT) testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks.
arXiv Detail & Related papers (2020-11-02T13:25:39Z)
Overestimation of Syntactic Representationin Neural Language Models [16.765097098482286]
One popular method for determining a model's ability to induce syntactic structure trains a model on strings generated according to a template then tests the model's ability to distinguish such strings from superficially similar ones with different syntax. We illustrate a fundamental problem with this approach by reproducing positive results from a recent paper with two non-syntactic baseline language models.
arXiv Detail & Related papers (2020-04-10T15:13:03Z)
Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.