Related papers: On the Proper Treatment of Tokenization in Psycholinguistics

On the Proper Treatment of Tokenization in Psycholinguistics

URL: http://arxiv.org/abs/2410.02691v2
Date: Thu, 31 Oct 2024 12:40:33 GMT
Title: On the Proper Treatment of Tokenization in Psycholinguistics
Authors: Mario Giulianelli, Luca Malagutti, Juan Luis Gastaldi, Brian DuSell, Tim Vieira, Ryan Cotterell,
Abstract summary: The paper argues that token-level language models should be marginalized into character-level language models before they are used in psycholinguistic studies. We find various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
Score: 53.960910019072436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.

Related papers

Information Locality as an Inductive Bias for Neural Language Models [52.92279412466086]
We show that $m$local entropy are difficult for Transformer and LSTM LMs to learn languages.<n>These results suggest that neurals are highly sensitive to the statistical structure of a language.
arXiv Detail & Related papers (2025-06-05T15:21:05Z)
Ask a Local: Detecting Hallucinations With Specialized Model Divergence [0.16874375111244325]
We introduce "Ask a Local", a novel hallucination detection method for large language models.<n>Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans.<n>Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages.
arXiv Detail & Related papers (2025-06-03T20:00:49Z)
The Impact of Token Granularity on the Predictive Power of Language Model Surprisal [15.073507986272027]
One factor that has been overlooked in cognitive modeling is the granularity of subword tokens. Experiments with naturalistic reading times reveal a substantial influence of token granularity on surprisal. On garden-path constructions, language models trained on coarser-grained tokens generally assigned higher surprisal to critical regions.
arXiv Detail & Related papers (2024-12-16T16:24:58Z)
Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z)
Learning an Artificial Language for Knowledge-Sharing in Multilingual Translation [15.32063273544696]
We discretize the latent space of multilingual models by assigning encoder states to entries in a codebook. We validate our approach on large-scale experiments with realistic data volumes and domains. We also use the learned artificial language to analyze model behavior, and discover that using a similar bridge language increases knowledge-sharing among the remaining languages.
arXiv Detail & Related papers (2022-11-02T17:14:42Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish. Using our approach, we isolate morpho-syntactic features from counfounders in sentences. We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z)
You should evaluate your language model on marginal likelihood overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations. We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
Language Model Evaluation Beyond Perplexity [47.268323020210175]
We analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions.
arXiv Detail & Related papers (2021-05-31T20:13:44Z)
Linguistically inspired morphological inflection with a sequence to sequence model [19.892441884896893]
Our research question is whether a neural network would be capable of learning inflectional morphemes for inflection production. We are using an inflectional corpus and a single layer seq2seq model to test this hypothesis. Our character-morpheme-based model creates inflection by predicting the stem character-to-character and the inflectional affixes as character blocks.
arXiv Detail & Related papers (2020-09-04T08:58:42Z)
Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.