When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
- URL: http://arxiv.org/abs/2303.16166v5
- Date: Thu, 4 Jul 2024 09:16:22 GMT
- Title: When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
- Authors: Sara Papi, Marco Gaido, Andrea Pilzer, Matteo Negri,
- Abstract summary: We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture.
We propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models.
- Score: 23.30735117217225
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.
Related papers
- Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course [1.553083901660282]
Testing plays an important role in securing the success of a software development project.
We investigate whether we can quantify the effects various types of testing have on functional suitability.
arXiv Detail & Related papers (2024-08-22T04:23:51Z) - Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail.
The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure.
Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.
We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets.
We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Uncertainty Awareness of Large Language Models Under Code Distribution
Shifts: A Benchmark Study [14.507068647009602]
Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity.
Their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs.
Probability methods are known to mitigate such impact through uncertainty calibration and estimation.
arXiv Detail & Related papers (2024-01-12T00:00:32Z) - Applying Bayesian Data Analysis for Causal Inference about Requirements Quality: A Controlled Experiment [4.6068376339651635]
It is commonly accepted that the quality of requirements specifications impacts subsequent software engineering activities.
We aim to contribute empirical evidence to the effect that requirements quality defects have on a software engineering activity.
arXiv Detail & Related papers (2024-01-02T11:08:39Z) - Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model [77.19693792957614]
We propose to make neural machine translation (NMT) models quality-aware by training them to estimate the quality of their own output.
We obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.
arXiv Detail & Related papers (2023-10-10T15:33:51Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Information-Theoretic Testing and Debugging of Fairness Defects in Deep
Neural Networks [13.425444923812586]
Deep feedforward neural networks (DNNs) are increasingly deployed in socioeconomic critical decision support software systems.
We present DICE: an information-theoretic testing and debug framework to discover and localize fairness defects in DNNs.
We show that DICE efficiently characterizes the amounts of discrimination, effectively generates discriminatory instances, and localizes layers/neurons with significant biases.
arXiv Detail & Related papers (2023-04-09T09:16:27Z) - Benchopt: Reproducible, efficient and collaborative optimization
benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning.
Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z) - DeepZensols: Deep Natural Language Processing Framework [23.56171046067646]
This work is a framework that is able to reproduce consistent results.
It provides a means of easily creating, training, and evaluating natural language processing (NLP) deep learning (DL) models.
arXiv Detail & Related papers (2021-09-08T01:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.