Quantifying Reproducibility in NLP and ML
- URL: http://arxiv.org/abs/2109.01211v1
- Date: Thu, 2 Sep 2021 21:00:17 GMT
- Title: Quantifying Reproducibility in NLP and ML
- Authors: Anya Belz
- Abstract summary: Reproducibility has become an intensely debated topic in NLP and ML over recent years.
No commonly accepted way of assessing, let alone quantifying it, has so far emerged.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reproducibility has become an intensely debated topic in NLP and ML over
recent years, but no commonly accepted way of assessing reproducibility, let
alone quantifying it, has so far emerged. The assumption has been that wider
scientific reproducibility terminology and definitions are not applicable to
NLP/ML, with the result that many different terms and definitions have been
proposed, some diametrically opposed. In this paper, we test this assumption,
by taking the standard terminology and definitions from metrology and applying
them directly to NLP/ML. We find that we are able to straightforwardly derive a
practical framework for assessing reproducibility which has the desirable
property of yielding a quantified degree of reproducibility that is comparable
across different reproduction studies.
Related papers
- Fairness Definitions in Language Models Explained [2.443957114877221]
Language Models (LMs) have demonstrated exceptional performance across various Natural Language Processing (NLP) tasks.
Despite these advancements, LMs can inherit and amplify societal biases related to sensitive attributes such as gender and race.
This paper proposes a systematic survey that clarifies the definitions of fairness as they apply to LMs.
arXiv Detail & Related papers (2024-07-26T01:21:25Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [79.9629927171974]
Uncertainty in Large Language Models (LLMs) is crucial for applications where safety and reliability are important.
We propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs.
arXiv Detail & Related papers (2024-05-30T12:42:05Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Language models are not naysayers: An analysis of language models on
negation benchmarks [58.32362243122714]
We evaluate the ability of current-generation auto-regressive language models to handle negation.
We show that LLMs have several limitations including insensitivity to the presence of negation, an inability to capture the lexical semantics of negation, and a failure to reason under negation.
arXiv Detail & Related papers (2023-06-14T01:16:37Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Quantified Reproducibility Assessment of NLP Results [5.181381829976355]
This paper describes and tests a method for carrying out quantified assessment (QRA) based on concepts and definitions from metrology.
We test QRA on 18 system and evaluation measure combinations, for each of which we have the original results and one to seven reproduction results.
The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies.
arXiv Detail & Related papers (2022-04-12T17:22:46Z) - A Systematic Review of Reproducibility Research in Natural Language
Processing [3.0039296468567236]
The past few years have seen an impressive range of new initiatives, events and active research in the area.
The field is far from reaching a consensus about how should be defined, measured and addressed.
arXiv Detail & Related papers (2021-03-14T13:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.