Related papers: When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP

When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP

URL: http://arxiv.org/abs/2303.16166v5
Date: Thu, 4 Jul 2024 09:16:22 GMT
Title: When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
Authors: Sara Papi, Marco Gaido, Andrea Pilzer, Matteo Negri,
Abstract summary: We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. We propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models.
Score: 23.30735117217225
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.

Related papers

Automated Unit Test Case Generation: A Systematic Literature Review [2.273531916003657]
This review aims to consolidate existing knowledge in regards to the evolutionary approaches as well as their improvements and resulting limitations. We will explore the main test criterion that are used in these algorithms alongside the challenges currently faced in the field related to readability, mocking and more.
arXiv Detail & Related papers (2025-04-29T01:50:06Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions. Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course [1.553083901660282]
Testing plays an important role in securing the success of a software development project. We investigate whether we can quantify the effects various types of testing have on functional suitability.
arXiv Detail & Related papers (2024-08-22T04:23:51Z)
Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets. We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training. We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines. To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z)
Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study [14.507068647009602]
Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity. Their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs. Probability methods are known to mitigate such impact through uncertainty calibration and estimation.
arXiv Detail & Related papers (2024-01-12T00:00:32Z)
Applying Bayesian Data Analysis for Causal Inference about Requirements Quality: A Controlled Experiment [4.6068376339651635]
It is commonly accepted that the quality of requirements specifications impacts subsequent software engineering activities. We aim to contribute empirical evidence to the effect that requirements quality defects have on a software engineering activity.
arXiv Detail & Related papers (2024-01-02T11:08:39Z)
Quality-Aware Translation Models: Efficient Generation and Quality Estimation in a Single Model [77.19693792957614]
We propose to make neural machine translation (NMT) models quality-aware by training them to estimate the quality of their own output. We obtain quality gains similar or even superior to quality reranking approaches, but with the efficiency of single pass decoding.
arXiv Detail & Related papers (2023-10-10T15:33:51Z)
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z)
Information-Theoretic Testing and Debugging of Fairness Defects in Deep Neural Networks [13.425444923812586]
Deep feedforward neural networks (DNNs) are increasingly deployed in socioeconomic critical decision support software systems. We present DICE: an information-theoretic testing and debug framework to discover and localize fairness defects in DNNs. We show that DICE efficiently characterizes the amounts of discrimination, effectively generates discriminatory instances, and localizes layers/neurons with significant biases.
arXiv Detail & Related papers (2023-04-09T09:16:27Z)
Benchopt: Reproducible, efficient and collaborative optimization benchmarks [67.29240500171532]
Benchopt is a framework to automate, reproduce and publish optimization benchmarks in machine learning. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments.
arXiv Detail & Related papers (2022-06-27T16:19:24Z)
DeepZensols: Deep Natural Language Processing Framework [23.56171046067646]
This work is a framework that is able to reproduce consistent results. It provides a means of easily creating, training, and evaluating natural language processing (NLP) deep learning (DL) models.
arXiv Detail & Related papers (2021-09-08T01:16:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.