Physician Detection of Clinical Harm in Machine Translation: Quality
Estimation Aids in Reliance and Backtranslation Identifies Critical Errors
- URL: http://arxiv.org/abs/2310.16924v1
- Date: Wed, 25 Oct 2023 18:44:14 GMT
- Title: Physician Detection of Clinical Harm in Machine Translation: Quality
Estimation Aids in Reliance and Backtranslation Identifies Critical Errors
- Authors: Nikita Mehandru, Sweta Agrawal, Yimin Xiao, Elaine C Khoong, Ge Gao,
Marine Carpuat, Niloufar Salehi
- Abstract summary: This paper evaluates quality estimation feedback in vivo with a human study simulating decision-making in high-stakes medical settings.
We find that quality estimation improves appropriate reliance on MT, but backtranslation helps physicians detect more clinically harmful errors that QE alone often misses.
- Score: 27.13497855061732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major challenge in the practical use of Machine Translation (MT) is that
users lack guidance to make informed decisions about when to rely on outputs.
Progress in quality estimation research provides techniques to automatically
assess MT quality, but these techniques have primarily been evaluated in vitro
by comparison against human judgments outside of a specific context of use.
This paper evaluates quality estimation feedback in vivo with a human study
simulating decision-making in high-stakes medical settings. Using Emergency
Department discharge instructions, we study how interventions based on quality
estimation versus backtranslation assist physicians in deciding whether to show
MT outputs to a patient. We find that quality estimation improves appropriate
reliance on MT, but backtranslation helps physicians detect more clinically
harmful errors that QE alone often misses.
Related papers
- Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation [2.7379431425414693]
This paper explores the potential of using Large Language Models (LLMs) to automate the evaluation of responses in medical Question and Answer (Q&A) systems.
arXiv Detail & Related papers (2024-09-03T14:38:29Z) - Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA [24.10436440624249]
Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA)
This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions.
arXiv Detail & Related papers (2024-05-30T18:56:01Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Competency-Aware Neural Machine Translation: Can Machine Translation
Know its Own Translation Quality? [61.866103154161884]
Neural machine translation (NMT) is often criticized for failures that happen without awareness.
We propose a novel competency-aware NMT by extending conventional NMT with a self-estimator.
We show that the proposed method delivers outstanding performance on quality estimation.
arXiv Detail & Related papers (2022-11-25T02:39:41Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using
Professional Post-Editing Towards More Effective MT Evaluation [0.0]
In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output.
It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit.
The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR.
arXiv Detail & Related papers (2021-12-27T18:47:43Z) - Measuring Uncertainty in Translation Quality Evaluation (TQE) [62.997667081978825]
This work carries out motivated research to correctly estimate the confidence intervals citeBrown_etal2001Interval depending on the sample size of the translated text.
The methodology we applied for this work is from Bernoulli Statistical Distribution Modelling (BSDM) and Monte Carlo Sampling Analysis (MCSA)
arXiv Detail & Related papers (2021-11-15T12:09:08Z) - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation
for Machine Translation [19.116396693370422]
We propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics framework.
We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs.
We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers.
arXiv Detail & Related papers (2021-04-29T16:42:09Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z) - A Set of Recommendations for Assessing Human-Machine Parity in Language
Translation [87.72302201375847]
We reassess Hassan et al.'s investigation into Chinese to English news translation.
We show that the professional human translations contained significantly fewer errors.
arXiv Detail & Related papers (2020-04-03T17:49:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.