From Tokens to Words: On the Inner Lexicon of LLMs
- URL: http://arxiv.org/abs/2410.05864v2
- Date: Thu, 10 Oct 2024 12:41:26 GMT
- Title: From Tokens to Words: On the Inner Lexicon of LLMs
- Authors: Guy Kaplan, Matanel Oren, Yuval Reif, Roy Schwartz,
- Abstract summary: Natural language is composed of words, but modern LLMs process sub-words as input.
We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations.
Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
- Score: 7.148628740938674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language is composed of words, but modern LLMs process sub-words as input. A natural question raised by this discrepancy is whether LLMs encode words internally, and if so how. We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent word representations. Our experiments show that this process takes place primarily within the early and middle layers of the model. They also show that it is robust to non-morphemic splits, typos and perhaps importantly-to out-of-vocabulary words: when feeding the inner representation of such words to the model as input vectors, it can "understand" them despite never seeing them during training. Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope. These insights provide a practical, finetuning-free application for expanding the vocabulary of pre-trained models. By enabling the addition of new vocabulary words, we reduce input length and inference iterations, which reduces both space and model latency, with little to no loss in model accuracy.
Related papers
- Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs [20.1025293763531]
Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east"
In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers.
arXiv Detail & Related papers (2024-06-28T17:54:47Z) - Active Use of Latent Constituency Representation in both Humans and Large Language Models [9.995581737621505]
We show that a latent tree-structured constituency representation can emerge in both the human brain and large language models.
Results demonstrate that a latent tree-structured constituency representation can emerge in both the human brain and LLMs.
arXiv Detail & Related papers (2024-05-28T14:50:22Z) - Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2.
Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction.
This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z) - Word Embeddings Revisited: Do LLMs Offer Something New? [2.822851601000061]
Learning meaningful word embeddings is key to training a robust language model.
The recent rise of Large Language Models (LLMs) has provided us with many new word/sentence/document embedding models.
arXiv Detail & Related papers (2024-02-16T21:47:30Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - Chain-of-Dictionary Prompting Elicits Translation in Large Language Models [100.47154959254937]
Large language models (LLMs) have shown surprisingly good performance in multilingual neural machine translation (MNMT)
We present a novel method, CoD, which augments LLMs with prior knowledge with the chains of multilingual dictionaries for a subset of input words to elicit translation abilities.
arXiv Detail & Related papers (2023-05-11T05:19:47Z) - Extensible Prompts for Language Models on Zero-shot Language Style
Customization [89.1622516945109]
X-Prompt instructs a large language model (LLM) beyond natural language (NL)
registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words.
These imaginary words are designed to be out-of-distribution robust so that they can be (re)used like NL words in various prompts.
arXiv Detail & Related papers (2022-12-01T16:11:56Z) - Breaking Character: Are Subwords Good Enough for MRLs After All? [36.11778282905458]
We pretraining a BERT-style language model over character sequences instead of word-pieces.
We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs.
Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks, subword-based PLMs achieve significantly higher performance on semantic tasks.
arXiv Detail & Related papers (2022-04-10T18:54:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.