Related papers: Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics

Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics

URL: http://arxiv.org/abs/2501.17187v1
Date: Sun, 26 Jan 2025 17:14:51 GMT
Title: Visualizing Uncertainty in Translation Tasks: An Evaluation of LLM Performance and Confidence Metrics
Authors: Jin Hyun Park, Utsawb Laminchhane, Umer Farooq, Uma Sivakumar, Arpan Kumar,
Abstract summary: Large language models (LLMs) are increasingly utilized for machine translation, yet their predictions often exhibit uncertainties that hinder interpretability and user trust.<n>This paper addresses two primary objectives: (1) providing users with token-level insights into model confidence and (2) developing a web-based visualization tool to quantify and represent translation uncertainties.
Score: 0.20971479389679337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly utilized for machine translation, yet their predictions often exhibit uncertainties that hinder interpretability and user trust. Effectively visualizing these uncertainties can enhance the usability of LLM outputs, particularly in contexts where translation accuracy is critical. This paper addresses two primary objectives: (1) providing users with token-level insights into model confidence and (2) developing a web-based visualization tool to quantify and represent translation uncertainties. To achieve these goals, we utilized the T5 model with the WMT19 dataset for translation tasks and evaluated translation quality using established metrics such as BLEU, METEOR, and ROUGE. We introduced three novel uncertainty quantification (UQ) metrics: (1) the geometric mean of token probabilities, (2) the arithmetic mean of token probabilities, and (3) the arithmetic mean of the kurtosis of token distributions. These metrics provide a simple yet effective framework for evaluating translation performance. Our analysis revealed a linear relationship between the traditional evaluation metrics and our UQ metrics, demonstrating the validity of our approach. Additionally, we developed an interactive web-based visualization that uses a color gradient to represent token confidence. This tool offers users a clear and intuitive understanding of translation quality while providing valuable insights into model performance. Overall, we show that our UQ metrics and visualization are both robust and interpretable, offering practical tools for evaluating and accessing machine translation systems.

Related papers

Understanding GUI Agent Localization Biases through Logit Sharpness [15.986679553468989]
Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions.<n>Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability.<n>We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics.
arXiv Detail & Related papers (2025-06-18T12:55:35Z)
Advancing Explainability in Neural Machine Translation: Analytical Metrics for Attention and Alignment Consistency [2.4022340214033915]
We introduce a systematic framework to quantitatively evaluate the explainability of an NMT model attention patterns.<n>We present a set of metrics attention entropy and alignment agreement and validate them on an English-German test subset from WMT14.<n>Our results indicate that sharper attention distributions correlate with improved interpretability but do not always guarantee better translation quality.
arXiv Detail & Related papers (2024-12-24T20:08:33Z)
Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation [64.5862977630713]
This study investigates how Large Language Models (LLMs) leverage source and reference data in machine translation evaluation task. We find that reference information significantly enhances the evaluation accuracy, while surprisingly, source information sometimes is counterproductive.
arXiv Detail & Related papers (2024-01-12T13:23:21Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training [64.37683359609308]
In this study, we analyze various mainstream and cutting-edge automatic metrics from the perspective of their guidance for training machine translation systems. We find that certain metrics exhibit robustness defects, such as the presence of universal adversarial translations in BLEURT and BARTScore. In-depth analysis suggests two main causes of these robustness deficits: distribution biases in the training datasets, and the tendency of the metric paradigm.
arXiv Detail & Related papers (2023-07-06T16:59:30Z)
Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation [160.07938471250048]
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics. We develop strong-performing automatic metrics for reference-based summarization evaluation.
arXiv Detail & Related papers (2023-03-07T02:49:50Z)
A Fine-grained Interpretability Evaluation Benchmark for Neural NLP [44.08113828762984]
This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension. We provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability.
arXiv Detail & Related papers (2022-05-23T07:37:04Z)
PreQuEL: Quality Estimation of Machine Translation Outputs in Advance [32.922128367314194]
A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation. We develop a baseline model for the task and analyze its performance. We show that this augmentation method can improve the performance of the Quality-Estimation task as well.
arXiv Detail & Related papers (2022-05-18T18:55:05Z)
A global analysis of metrics used for measuring performance in natural language processing [9.433496814327086]
We provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing. Results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance.
arXiv Detail & Related papers (2022-04-25T11:41:50Z)
Conditional Bilingual Mutual Information Based Adaptive Training for Neural Machine Translation [66.23055784400475]
Token-level adaptive training approaches can alleviate the token imbalance problem. We propose a target-context-aware metric, named conditional bilingual mutual information (CBMI) CBMI can be efficiently calculated during model training without any pre-specific statistical calculations.
arXiv Detail & Related papers (2022-03-06T12:34:10Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.