Correlation and Navigation in the Vocabulary Key Representation Space of Language Models
- URL: http://arxiv.org/abs/2410.02284v1
- Date: Thu, 3 Oct 2024 08:07:55 GMT
- Title: Correlation and Navigation in the Vocabulary Key Representation Space of Language Models
- Authors: Letian Peng, Chenyang An, Jingbo Shang,
- Abstract summary: We study the effect of the key distribution on the NTP distribution.
We show that in the NTP distribution, the few top-ranked tokens are typically accurate.
We extend our method to open-ended and chain-of-thought (for reasoning) generation.
- Score: 33.747872934103334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is essentially a softmax-regularized dot product between an encoded input context (query) and fixed vocabulary representations (keys). In this paper, we study the effect of the key distribution on the NTP distribution, with a focus on whether the similarity between keys will trigger spurious correlations in NTP. Through knowledge-probing tasks, we show that in the NTP distribution, the few top-ranked tokens are typically accurate. However, the middle-ranked prediction is highly biased towards the tokens that are distributionally (not necessarily semantically) similar to these top ones. For instance, if "P" is predicted as the top-1 token, "A"-"Z" will all be ranked high in NTP, no matter whether they can lead to correct decoding results. This hurts the sampling diversity and makes the sampling of correct, long-tail results hopeless and noisy. We attempt to alleviate this issue via a novel in-context method that iteratively pushes the query representation away from explored regions. Specifically, we include the explored decoding results in the context and prompt the LM to generate something else, which encourages the LM to produce a query representation that has small dot products with explored keys. Experiments on knowledge-probing tasks show that our method leads to efficient navigation away from explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show that ICN contributes to better generation diversity and improved self-consistency voting performance. Finally, we discuss potential training issues caused by the fixed key space together with the challenges and possible ways to address them in future research.
Related papers
- NDP: Next Distribution Prediction as a More Broad Target [59.30497395313209]
We introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets.
NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain.
arXiv Detail & Related papers (2024-08-30T16:13:49Z) - Assessing Keyness using Permutation Tests [0.0]
We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens.
We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score.
arXiv Detail & Related papers (2023-08-25T13:52:57Z) - Bridging the Domain Gaps in Context Representations for k-Nearest
Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains.
We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore.
Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z) - KNN-LM Does Not Improve Open-ended Text Generation [34.86733697757264]
We study the generation quality of retrieval-augmented language models (LMs)
We find that interpolating with a retrieval distribution actually increases perplexity compared to a baseline Transformer LM.
We discover that the entropy of the retrieval distribution increases faster than that of the base LM as the generated sequence becomes longer.
arXiv Detail & Related papers (2023-05-24T01:48:33Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - NFL: Robust Learned Index via Distribution Transformation [14.812854942243503]
This paper tackles the approximation problem by applying a textit distribution transformation to the keys before constructing the learned index.
A two-stage Normalizing-Flow-based Learned index framework (NFL) is proposed, which first transforms the original complex key distribution into a near-uniform distribution, then builds a learned index leveraging the transformed keys.
Based on the characteristics of the transformed keys, we propose a robust After-Flow Learned Index (AFLI)
arXiv Detail & Related papers (2022-05-24T06:03:19Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - Bridging Few-Shot Learning and Adaptation: New Challenges of
Support-Query Shift [4.374837991804085]
Few-Shot Learning algorithms have made substantial progress in learning novel concepts with just a handful of labelled data.
To classify query instances from novel classes encountered at test-time, they only require a support set composed of a few labelled samples.
In a realistic set-ting, data distribution is plausibly subject to change, a situation referred to as Distribution Shift (DS)
arXiv Detail & Related papers (2021-05-25T10:10:09Z) - FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding [55.38905499274026]
Few-shot learning is one of the key future steps in machine learning.
FewJoint is a novel Few-Shot Learning benchmark for NLP.
arXiv Detail & Related papers (2020-09-17T08:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.