Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in
Practice
- URL: http://arxiv.org/abs/2203.06462v1
- Date: Sat, 12 Mar 2022 15:34:54 GMT
- Title: Low-Rank Softmax Can Have Unargmaxable Classes in Theory but Rarely in
Practice
- Authors: Andreas Grivas, Nikolay Bogoychev, Adam Lopez
- Abstract summary: We develop algorithms to detect emphunargmaxable tokens public language models.
We find that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality.
- Score: 18.296971636710985
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Classifiers in natural language processing (NLP) often have a large number of
output classes. For example, neural language models (LMs) and machine
translation (MT) models both predict tokens from a vocabulary of thousands. The
Softmax output layer of these models typically receives as input a dense
feature representation, which has much lower dimensionality than the output. In
theory, the result is some words may be impossible to be predicted via argmax,
irrespective of input features, and empirically, there is evidence this happens
in small language models. In this paper we ask whether it can happen in
practical large language models and translation models. To do so, we develop
algorithms to detect such \emph{unargmaxable} tokens in public models. We find
that 13 out of 150 models do indeed have such tokens; however, they are very
infrequent and unlikely to impact model quality. We release our algorithms and
code to the public.
Related papers
- Distribution-Aware Companding Quantization of Large Language Models [0.0]
Large language models such as GPT and Llama are trained with a next-token prediction loss.<n>We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
arXiv Detail & Related papers (2026-02-27T23:00:54Z) - Provably Learning from Modern Language Models via Low Logit Rank [22.148282143726835]
Low logit rank models can encode hard-to-learn distributions such as noisy parities.<n>We show how this structure can be exploited algorithmically for obtaining provable learning guarantees.<n>Our result gives what we believe is the first end-to-end learning guarantee for a generative model that plausibly captures modern language models.
arXiv Detail & Related papers (2025-12-10T18:18:11Z) - Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.
Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.
We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z) - From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings.
This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Better & Faster Large Language Models via Multi-token Prediction [29.067271500844928]
Large language models such as GPT and Llama are trained with a next-token prediction loss.
We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
arXiv Detail & Related papers (2024-04-30T17:33:57Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Rarely a problem? Language models exhibit inverse scaling in their
predictions following few-type quantifiers [0.6091702876917281]
We focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models.
We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes.
arXiv Detail & Related papers (2022-12-16T20:01:22Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Limitations of Autoregressive Models and Their Alternatives [31.827580420643606]
These limitations apply no matter how much computation and data are used to train the model.
Energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string) are powerful enough to escape these limitations.
arXiv Detail & Related papers (2020-10-22T17:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.