Lexinvariant Language Models
- URL: http://arxiv.org/abs/2305.16349v1
- Date: Wed, 24 May 2023 19:10:46 GMT
- Title: Lexinvariant Language Models
- Authors: Qian Huang, Eric Zelikman, Sarah Li Chen, Yuhuai Wu, Gregory Valiant,
Percy Liang
- Abstract summary: Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
- Score: 84.2829117441298
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Token embeddings, a mapping from discrete lexical symbols to continuous
vectors, are at the heart of any language model (LM). However, lexical symbol
meanings can also be determined and even redefined by their structural role in
a long context. In this paper, we ask: is it possible for a language model to
be performant without \emph{any} fixed token embeddings? Such a language model
would have to rely entirely on the co-occurence and repetition of tokens in the
context rather than the \textit{a priori} identity of any token. To answer
this, we study \textit{lexinvariant}language models that are invariant to
lexical symbols and therefore do not need fixed token embeddings in practice.
First, we prove that we can construct a lexinvariant LM to converge to the true
language model at a uniform rate that is polynomial in terms of the context
length, with a constant factor that is sublinear in the vocabulary size.
Second, to build a lexinvariant LM, we simply encode tokens using random
Gaussian vectors, such that each token maps to the same representation within
each sequence but different representations across sequences. Empirically, we
demonstrate that it can indeed attain perplexity comparable to that of a
standard language model, given a sufficiently long context. We further explore
two properties of the lexinvariant language models: First, given text generated
from a substitution cipher of English, it implicitly implements Bayesian
in-context deciphering and infers the mapping to the underlying real tokens
with high accuracy. Second, it has on average 4X better accuracy over synthetic
in-context reasoning tasks. Finally, we discuss regularizing standard language
models towards lexinvariance and potential practical applications.
Related papers
- Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence [6.991281327290525]
We propose a novel approach for learning interchangeable tokens in language models.
Our method is designed to address alpha-equivalence, the principle that renaming bound variables in a syntactic expression preserves semantics.
arXiv Detail & Related papers (2024-10-22T16:34:36Z) - Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a framework for learning a unified lexical representation for both modalities without complex design.
We use DINOv2 as our visual model for its local-inclined features and Llama 2, a generative language model, to leverage its in-context lexical prediction ability.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on the modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings [4.2243058640527575]
Cross-lingual transfer learning is an important property of multilingual large language models (LLMs)
Our research opens the door for investigations in 1) The effect of pre-training and model architectures on representations of languages and 2) The applications of cross-lingual representations embedded in language models.
arXiv Detail & Related papers (2023-11-29T19:20:14Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.