Related papers: What do tokens know about their characters and how do they know it?

What do tokens know about their characters and how do they know it?

URL: http://arxiv.org/abs/2206.02608v1
Date: Mon, 6 Jun 2022 13:27:26 GMT
Title: What do tokens know about their characters and how do they know it?
Authors: Ayush Kaushal, Kyle Mahowald
Abstract summary: We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information. We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
Score: 3.8254443661593633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

Related papers

From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings. This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs. We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
Identifying and Analyzing Task-Encoding Tokens in Large Language Models [55.03191279766383]
In this paper, we identify and analyze task-encoding tokens on whose representations the task performance depends. We show that template and stopword tokens are the most prone to be task-encoding. Our work sheds light on how large language models (LLMs) learn to perform a task from demonstrations, deepens our understanding of the varied roles different types of tokens play in LLMs, and provides insights for avoiding instability from improperly utilizing task-encoding tokens.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
Toucan: Token-Aware Character Level Language Modeling [44.85590844938571]
Toucan is an augmentation to character-level models to make them "token-aware" We show significant speed-ups in character generation without a loss in language modeling performance. Our approach leads to a greater amount of longer sequences tokenized as single items.
arXiv Detail & Related papers (2023-11-15T00:57:51Z)
Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities. We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z)
Understanding the Role of Input Token Characters in Language Models: How Does Information Loss Affect Performance? [45.53600782873268]
We study how information loss in input token characters affects the performance of pre-training language models. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks is high. For instance, a model pre-trained only on single first characters from tokens achieves performance retention of approximately $90$% and $77$% of the full-token model in SuperGLUE and GLUE tasks, respectively.
arXiv Detail & Related papers (2023-10-26T09:47:50Z)
Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens [22.55706811131828]
We probe the embedding layer of pretrained language models. We show that models learn the internal character composition of whole word and subword tokens.
arXiv Detail & Related papers (2021-08-25T11:48:05Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information [29.633735942273997]
XRayEmb is a method for retrofitting existing token-based models with character-level information. We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
arXiv Detail & Related papers (2021-08-01T08:09:26Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models. We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.