Text vectorization via transformer-based language models and n-gram
perplexities
- URL: http://arxiv.org/abs/2307.09255v1
- Date: Tue, 18 Jul 2023 13:38:39 GMT
- Title: Text vectorization via transformer-based language models and n-gram
perplexities
- Authors: Mihailo \v{S}kori\'c
- Abstract summary: Given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation.
This research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As the probability (and thus perplexity) of a text is calculated based on the
product of the probabilities of individual tokens, it may happen that one
unlikely token significantly reduces the probability (i.e., increase the
perplexity) of some otherwise highly probable input, while potentially
representing a simple typographical error. Also, given that perplexity is a
scalar value that refers to the entire input, information about the probability
distribution within it is lost in the calculation (a relatively good text that
has one unlikely token and another text in which each token is equally likely
they can have the same perplexity value), especially for longer texts. As an
alternative to scalar perplexity this research proposes a simple algorithm used
to calculate vector values based on n-gram perplexities within the input. Such
representations consider the previously mentioned aspects, and instead of a
unique value, the relative perplexity of each text token is calculated, and
these values are combined into a single vector representing the input.
Related papers
- Unsupervised Representation Learning from Sparse Transformation Analysis [79.94858534887801]
We propose to learn representations from sequence data by factorizing the transformations of the latent variables into sparse components.
Input data are first encoded as distributions of latent activations and subsequently transformed using a probability flow model.
arXiv Detail & Related papers (2024-10-07T23:53:25Z) - Where is the signal in tokenization space? [31.016041295876864]
Large Language Models (LLMs) are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences.
In this paper, we study non-canonical tokenizations.
arXiv Detail & Related papers (2024-08-16T05:56:10Z) - Estimation of embedding vectors in high dimensions [10.55292041492388]
We consider a simple probability model for discrete data where there is some "true" but unknown embedding.
Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method.
Our theoretical findings are validated by simulations on both synthetic data and real text data.
arXiv Detail & Related papers (2023-12-12T23:41:59Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling.
We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability.
Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z) - Should you marginalize over possible tokenizations? [13.07994518230055]
We show that the gap in log-likelihood is no larger than 0.5% in most cases.
Our results show that the gap in log-likelihood is no larger than 0.5% in most cases.
arXiv Detail & Related papers (2023-06-30T16:09:01Z) - How Does Pseudo-Labeling Affect the Generalization Error of the
Semi-Supervised Gibbs Algorithm? [73.80001705134147]
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm.
The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset.
arXiv Detail & Related papers (2022-10-15T04:11:56Z) - Measuring the Interpretability of Unsupervised Representations via
Quantized Reverse Probing [97.70862116338554]
We investigate the problem of measuring interpretability of self-supervised representations.
We formulate the latter as estimating the mutual information between the representation and a space of manually labelled concepts.
We use our method to evaluate a large number of self-supervised representations, ranking them by interpretability.
arXiv Detail & Related papers (2022-09-07T16:18:50Z) - Large-Margin Representation Learning for Texture Classification [67.94823375350433]
This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification.
The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.
arXiv Detail & Related papers (2022-06-17T04:07:45Z) - Gradient Origin Networks [8.952627620898074]
This paper proposes a new type of generative model that is able to quickly learn a latent representation without an encoder.
Experiments show that the proposed method converges faster, with significantly lower reconstruction error than autoencoders, while requiring half the parameters.
arXiv Detail & Related papers (2020-07-06T15:00:11Z) - Probabilistic embeddings for speaker diarization [13.276960253126656]
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization.
We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix.
These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments.
arXiv Detail & Related papers (2020-04-06T14:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.