Related papers: Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar

URL: http://arxiv.org/abs/2505.19599v1
Date: Mon, 26 May 2025 07:08:47 GMT
Title: Inconsistent Tokenizations Cause Language Models to be Perplexed by Japanese Grammar
Authors: Andrew Gambardella, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo,
Abstract summary: We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese.<n>We show in further experiments that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.
Score: 27.3347020320559
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Typical methods for evaluating the performance of language models evaluate their ability to answer questions accurately. These evaluation metrics are acceptable for determining the extent to which language models can understand and reason about text in a general sense, but fail to capture nuanced capabilities, such as the ability of language models to recognize and obey rare grammar points, particularly in languages other than English. We measure the perplexity of language models when confronted with the "first person psych predicate restriction" grammar point in Japanese. Weblab is the only tested open source model in the 7-10B parameter range which consistently assigns higher perplexity to ungrammatical psych predicate sentences than grammatical ones. We give evidence that Weblab's uniformly bad tokenization is a possible root cause for its good performance, and show that Llama 3's perplexity on grammatical psych predicate sentences can be reduced by orders of magnitude (28x difference) by restricting test sentences to those with uniformly well-behaved tokenizations. We show in further experiments on machine translation tasks that language models will use alternative grammar patterns in order to produce grammatical sentences when tokenization issues prevent the most natural sentence from being output.

Related papers

Large Language Model probabilities cannot distinguish between possible and impossible language [0.11726720776908521]
We use model-internal representations to tap directly into the way Large Language Models represent the 'grammatical-ungrammatical' distinction.<n>We predict that if string-probabilities can function as proxies for the limits of grammar, the ungrammatical condition will stand out among the conditions that involve linguistic violations.<n>Our results do not reveal a unique surprisal signature for ungrammatical prompts, as the semantically and pragmatically odd conditions consistently show higher surprisal.
arXiv Detail & Related papers (2025-09-18T16:17:48Z)
Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts [9.971070147103536]
Punctuation plays a vital role in structuring meaning, yet current models often struggle to restore it accurately in transcripts of spontaneous speech.<n>We introduce Cadence, a generalist punctuation restoration model adapted from a pretrained large language model.<n>It surpasses the previous state of the art in performance while expanding support from 14 to all 22 Indian languages and English.
arXiv Detail & Related papers (2025-06-04T09:54:38Z)
Negation: A Pink Elephant in the Large Language Models' Room? [2.8078480738404]
Negations are key to determining sentence meaning, making them essential for logical reasoning.<n>We investigate how model size and language impact its ability to handle negation correctly by evaluating popular language models.<n>Our datasets can facilitate further research and improvements of language model reasoning in multilingual settings.
arXiv Detail & Related papers (2025-03-28T13:04:41Z)
Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z)
You should evaluate your language model on marginal likelihood overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations. We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z)
Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language. We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences. We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z)
Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task. We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment [17.995905582226463]
We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish. English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences.
arXiv Detail & Related papers (2020-05-01T01:21:47Z)
Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.