Related papers: SUT: Active Defects Probing for Transcompiler Models

SUT: Active Defects Probing for Transcompiler Models

URL: http://arxiv.org/abs/2310.14209v1
Date: Sun, 22 Oct 2023 07:16:02 GMT
Title: SUT: Active Defects Probing for Transcompiler Models
Authors: Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin Gu, Colin Clement, Neel Sundaresan
Abstract summary: We introduce a new metrics for programming language translation and these metrics address basic syntax errors. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests.
Score: 24.01532199512389
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies.

Related papers

xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches. It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation. We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts. Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types. We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z)
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis [93.34894810865364]
Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. We augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors.
arXiv Detail & Related papers (2022-12-20T11:36:22Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Probing for targeted syntactic knowledge through grammatical error detection [13.653209309144593]
We propose grammatical error detection as a diagnostic probe to evaluate pre-trained English language models. We leverage public annotated training data from both English second language learners and Wikipedia edits. We find that masked language models linearly encode information relevant to the detection of SVA errors, while the autoregressive models perform on par with our baseline.
arXiv Detail & Related papers (2022-10-28T16:01:25Z)
Improving Pre-trained Language Models with Syntactic Dependency Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors. Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.