Subword models struggle with word learning, but surprisal hides it
- URL: http://arxiv.org/abs/2502.12835v1
- Date: Tue, 18 Feb 2025 13:09:16 GMT
- Title: Subword models struggle with word learning, but surprisal hides it
- Authors: Bastian Bunzeck, Sina Zarrieß,
- Abstract summary: We study word learning in subword and character language models with the psycholinguistic lexical decision task.
While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently.
- Score: 8.883534683127415
- License:
- Abstract: We study word learning in subword and character language models with the psycholinguistic lexical decision task. While subword LMs struggle to discern words and non-words with high accuracy, character LMs solve this task easily and consistently. Furthermore, when comparing word learning and syntactic learning, both processes are separable in character LM where word learning predates syntactic learning, whereas these processes are simultaneous in subword LM. This raises questions about the adequacy of subword LMs for modeling language acquisition and positions character LMs as a viable alternative.
Related papers
- From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern LLMs process sub-words as input.
We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations.
Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z) - PhonologyBench: Evaluating Phonological Skills of Large Language Models [57.80997670335227]
Phonology, the study of speech's structure and pronunciation rules, is a critical yet often overlooked component in Large Language Model (LLM) research.
We present PhonologyBench, a novel benchmark consisting of three diagnostic tasks designed to explicitly test the phonological skills of LLMs.
We observe a significant gap of 17% and 45% on Rhyme Word Generation and Syllable counting, respectively, when compared to humans.
arXiv Detail & Related papers (2024-04-03T04:53:14Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Automating Knowledge Acquisition for Content-Centric Cognitive Agents
Using LLMs [0.0]
The paper describes a system that uses large language model (LLM) technology to support the automatic learning of new entries in an intelligent agent's semantic lexicon.
The process is bootstrapped by an existing non-toy lexicon and a natural language generator that converts formal, ontologically-grounded representations of meaning into natural language sentences.
arXiv Detail & Related papers (2023-12-27T02:31:51Z) - Word Embeddings Are Steers for Language Models [57.83026781380927]
We name such steers LM-Steers and find them existing in LMs of all sizes.
On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance.
An LM-Steer is transferrable between different language models by an explicit form calculation.
arXiv Detail & Related papers (2023-05-22T07:52:04Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Syllable Subword Tokens for Open Vocabulary Speech Recognition in
Malayalam [2.7823528791601695]
A lexicon pronunciation (PL) and a language model (LM) are essential to correctly retrieve spoken word sequences.
Being a morphologically complex language, the vocabulary of Malayalam is so huge and it is impossible to build a PL and an LM that cover all diverse word forms.
Usage of subword tokens to build PL and LM, and combining them to form words after decoding, enables the recovery of many out of vocabulary words.
arXiv Detail & Related papers (2023-01-17T07:29:47Z) - Extensible Prompts for Language Models on Zero-shot Language Style
Customization [89.1622516945109]
X-Prompt instructs a large language model (LLM) beyond natural language (NL)
registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words.
These imaginary words are designed to be out-of-distribution robust so that they can be (re)used like NL words in various prompts.
arXiv Detail & Related papers (2022-12-01T16:11:56Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Subword Mapping and Anchoring across Languages [1.9352552677009318]
Subword Mapping and Anchoring across Languages (SMALA) is a method to construct bilingual subword vocabularies.
SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique.
We show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
arXiv Detail & Related papers (2021-09-09T20:46:27Z) - Language-Independent Tokenisation Rivals Language-Specific Tokenisation
for Word Similarity Prediction [12.376752724719005]
Language-independent tokenisation (LIT) methods do not require labelled language resources or lexicons.
Language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources.
We empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages.
arXiv Detail & Related papers (2020-02-25T16:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.