Assessing Keyness using Permutation Tests
- URL: http://arxiv.org/abs/2308.13383v1
- Date: Fri, 25 Aug 2023 13:52:57 GMT
- Title: Assessing Keyness using Permutation Tests
- Authors: Thoralf Mildenberger
- Abstract summary: We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens.
We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a resampling-based approach for assessing keyness in corpus
linguistics based on suggestions by Gries (2006, 2022). Traditional approaches
based on hypothesis tests (e.g. Likelihood Ratio) model the copora as
independent identically distributed samples of tokens. This model does not
account for the often observed uneven distribution of occurences of a word
across a corpus. When occurences of a word are concentrated in few documents,
large values of LLR and similar scores are in fact much more likely than
accounted for by the token-by-token sampling model, leading to false positives.
We replace the token-by-token sampling model by a model where corpora are
samples of documents rather than tokens, which is much closer to the way
corpora are actually assembled. We then use a permutation approach to
approximate the distribution of a given keyness score under the null hypothesis
of equal frequencies and obtain p-values for assessing significance. We do not
need any assumption on how the tokens are organized within or across documents,
and the approach works with basically *any* keyness score. Hence, appart from
obtaining more accurate p-values for scores like LLR, we can also assess
significance for e.g. the logratio which has been proposed as a measure of
effect size.
An efficient implementation of the proposed approach is provided in the `R`
package `keyperm` available from github.
Related papers
- Correlation and Navigation in the Vocabulary Key Representation Space of Language Models [33.747872934103334]
We study the effect of the key distribution on the NTP distribution.
We show that in the NTP distribution, the few top-ranked tokens are typically accurate.
We extend our method to open-ended and chain-of-thought (for reasoning) generation.
arXiv Detail & Related papers (2024-10-03T08:07:55Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Prototype-based Aleatoric Uncertainty Quantification for Cross-modal
Retrieval [139.21955930418815]
Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space.
However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts.
We propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.
arXiv Detail & Related papers (2023-09-29T09:41:19Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement
Learning [30.09715149060206]
Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document.
In this paper, we propose a new fine-grained evaluation metric that considers different granularity.
For learning more recessive linguistic patterns, we use a pre-trained model (e.g., BERT) to compute the continuous similarity score between predicted keyphrases and target keyphrases.
arXiv Detail & Related papers (2021-04-18T10:13:46Z) - An Empirical Comparison of Instance Attribution Methods for NLP [62.63504976810927]
We evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples.
We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods.
arXiv Detail & Related papers (2021-04-09T01:03:17Z) - MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction.
MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context.
We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z) - Probabilistic Anchor Assignment with IoU Prediction for Object Detection [9.703212439661097]
In object detection, determining which anchors to assign as positive or negative samples, known as anchor assignment, has been revealed as a core procedure that can significantly affect a model's performance.
We propose a novel anchor assignment strategy that adaptively separates anchors into positive and negative samples for a ground truth bounding box according to the model's learning status.
arXiv Detail & Related papers (2020-07-16T04:26:57Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.