The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
- URL: http://arxiv.org/abs/2602.23928v1
- Date: Fri, 27 Feb 2026 11:23:45 GMT
- Title: The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
- Authors: Gary Lupyan, Senyi Yang,
- Abstract summary: Large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts.<n>We show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.
Related papers
- The unreasonable effectiveness of pattern matching [1.0780189313017459]
Large language models can make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings.<n>The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching.
arXiv Detail & Related papers (2026-01-16T16:53:08Z) - False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages.<n>Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages?<n>We find that models with overlap outperform models with disjoint vocabularies.
arXiv Detail & Related papers (2025-09-23T07:47:54Z) - On the Semantics of Large Language Models [0.0]
Large Language Models (LLMs) demonstrated the potential to replicate human language abilities through technology.<n>It remains controversial to what extent these systems truly understand language.<n>We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level.
arXiv Detail & Related papers (2025-07-07T20:02:57Z) - Infusing Prompts with Syntax and Semantics [0.0]
We analyze the effect of directly infusing various kinds of syntactic and semantic information into large language models.<n>We show that linguistic analysis can significantly boost language models, to the point that we have surpassed previous best systems.
arXiv Detail & Related papers (2024-12-08T23:49:38Z) - Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into
the Morphological Capabilities of a Large Language Model [23.60677380868016]
Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills.
Here, we conduct the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages.
We find that ChatGPT massively underperforms purpose-built systems, particularly in English.
arXiv Detail & Related papers (2023-10-23T17:21:03Z) - A blind spot for large language models: Supradiegetic linguistic information [0.602276990341246]
Large Language Models (LLMs) like ChatGPT achieve a linguistic fluency that is impressively, even shockingly, human-like.
We examine the details of this framing using ideas from several areas, including linguistics, embodied cognition, cognitive science, mathematics, and history.
We use these concepts to investigate why LLMs like ChatGPT have trouble handling palindromes, the visual characteristics of symbols, translating Sumerian cuneiform, and continuing integer sequences.
arXiv Detail & Related papers (2023-06-11T22:15:01Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.