On the Proper Treatment of Tokenization in Psycholinguistics
- URL: http://arxiv.org/abs/2410.02691v2
- Date: Thu, 31 Oct 2024 12:40:33 GMT
- Title: On the Proper Treatment of Tokenization in Psycholinguistics
- Authors: Mario Giulianelli, Luca Malagutti, Juan Luis Gastaldi, Brian DuSell, Tim Vieira, Ryan Cotterell,
- Abstract summary: The paper argues that token-level language models should be marginalized into character-level language models before they are used in psycholinguistic studies.
We find various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
- Score: 53.960910019072436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models are widely used in computational psycholinguistics to test theories that relate the negative log probability (the surprisal) of a region of interest (a substring of characters) under a language model to its cognitive cost experienced by readers, as operationalized, for example, by gaze duration on the region. However, the application of modern language models to psycholinguistic studies is complicated by the practice of using tokenization as an intermediate step in training a model. Doing so results in a language model over token strings rather than one over character strings. Vexingly, regions of interest are generally misaligned with these token strings. The paper argues that token-level language models should be (approximately) marginalized into character-level language models before they are used in psycholinguistic studies to compute the surprisal of a region of interest; then, the marginalized character-level language model can be used to compute the surprisal of an arbitrary character substring, which we term a focal area, that the experimenter may wish to use as a predictor. Our proposal of marginalizing a token-level model into a character-level one solves this misalignment issue independently of the tokenization scheme. Empirically, we discover various focal areas whose surprisal is a better psychometric predictor than the surprisal of the region of interest itself.
Related papers
- Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - Learning an Artificial Language for Knowledge-Sharing in Multilingual
Translation [15.32063273544696]
We discretize the latent space of multilingual models by assigning encoder states to entries in a codebook.
We validate our approach on large-scale experiments with realistic data volumes and domains.
We also use the learned artificial language to analyze model behavior, and discover that using a similar bridge language increases knowledge-sharing among the remaining languages.
arXiv Detail & Related papers (2022-11-02T17:14:42Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Naturalistic Causal Probing for Morpho-Syntax [76.83735391276547]
We suggest a naturalistic strategy for input-level intervention on real world data in Spanish.
Using our approach, we isolate morpho-syntactic features from counfounders in sentences.
We apply this methodology to analyze causal effects of gender and number on contextualized representations extracted from pre-trained models.
arXiv Detail & Related papers (2022-05-14T11:47:58Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Language Model Evaluation Beyond Perplexity [47.268323020210175]
We analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained.
We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions.
arXiv Detail & Related papers (2021-05-31T20:13:44Z) - Linguistically inspired morphological inflection with a sequence to
sequence model [19.892441884896893]
Our research question is whether a neural network would be capable of learning inflectional morphemes for inflection production.
We are using an inflectional corpus and a single layer seq2seq model to test this hypothesis.
Our character-morpheme-based model creates inflection by predicting the stem character-to-character and the inflectional affixes as character blocks.
arXiv Detail & Related papers (2020-09-04T08:58:42Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.