Reliability Testing for Natural Language Processing Systems
- URL: http://arxiv.org/abs/2105.02590v1
- Date: Thu, 6 May 2021 11:24:58 GMT
- Title: Reliability Testing for Natural Language Processing Systems
- Authors: Samson Tan, Shafiq Joty, Kathy Baxter, Araz Taeihagh, Gregory A.
Bennett, Min-Yen Kan
- Abstract summary: We argue for the need for reliability testing and contextualize it among existing work on improving accountability.
We show how adversarial attacks can be reframed for this goal, via a framework for developing reliability tests.
- Score: 14.393308846231083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Questions of fairness, robustness, and transparency are paramount to address
before deploying NLP systems. Central to these concerns is the question of
reliability: Can NLP systems reliably treat different demographics fairly and
function correctly in diverse and noisy environments? To address this, we argue
for the need for reliability testing and contextualize it among existing work
on improving accountability. We show how adversarial attacks can be reframed
for this goal, via a framework for developing reliability tests. We argue that
reliability testing -- with an emphasis on interdisciplinary collaboration --
will enable rigorous and targeted testing, and aid in the enactment and
enforcement of industry standards.
Related papers
- FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Trustworthiness for an Ultra-Wideband Localization Service [2.4979362117484714]
This paper proposes a holistic trustworthiness assessment framework for ultra-wideband self-localization.
Our goal is to provide guidance for evaluating a system's trustworthiness based on objective evidence.
Our approach guarantees that the resulting trustworthiness indicators correspond to chosen real-world threats.
arXiv Detail & Related papers (2024-08-10T11:57:10Z) - Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models [14.5291643644017]
We introduce the concept of Confidence-Probability Alignment.
We probe the alignment between models' internal and expressed confidence.
Among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment.
arXiv Detail & Related papers (2024-05-25T15:42:04Z) - When to Trust LLMs: Aligning Confidence with Response Quality [49.371218210305656]
We propose CONfidence-Quality-ORDer-preserving alignment approach (CONQORD)
It integrates quality reward and order-preserving alignment reward functions.
Experiments demonstrate that CONQORD significantly improves the alignment performance between confidence and response accuracy.
arXiv Detail & Related papers (2024-04-26T09:42:46Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z) - A Holistic Assessment of the Reliability of Machine Learning Systems [30.638615396429536]
This paper proposes a holistic assessment methodology for the reliability of machine learning (ML) systems.
Our framework evaluates five key properties: in-distribution accuracy, distribution-shift robustness, adversarial robustness, calibration, and out-of-distribution detection.
To provide insights into the performance of different algorithmic approaches, we identify and categorize state-of-the-art techniques.
arXiv Detail & Related papers (2023-07-20T05:00:13Z) - Did You Mean...? Confidence-based Trade-offs in Semantic Parsing [52.28988386710333]
We show how a calibrated model can help balance common trade-offs in task-oriented parsing.
We then examine how confidence scores can help optimize the trade-off between usability and safety.
arXiv Detail & Related papers (2023-03-29T17:07:26Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Demonstrating Software Reliability using Possibly Correlated Tests:
Insights from a Conservative Bayesian Approach [2.152298082788376]
We formalise informal notions of "doubting" that the executions are independent.
We develop techniques that reveal the extent to which independence assumptions can undermine conservatism in assessments.
arXiv Detail & Related papers (2022-08-16T20:27:47Z) - An evaluation of word-level confidence estimation for end-to-end
automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR)
We provide an extensive benchmark of popular confidence methods on four well-known speech datasets.
Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.