SUT: Active Defects Probing for Transcompiler Models
- URL: http://arxiv.org/abs/2310.14209v1
- Date: Sun, 22 Oct 2023 07:16:02 GMT
- Title: SUT: Active Defects Probing for Transcompiler Models
- Authors: Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin
Gu, Colin Clement, Neel Sundaresan
- Abstract summary: We introduce a new metrics for programming language translation and these metrics address basic syntax errors.
Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests.
- Score: 24.01532199512389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Program translation has enormous application value and hence has
been attracting significant interest from AI researchers. However, we observe
that current program translation models still make elementary syntax errors,
particularly, when the target language does not have syntax elements in the
source language. Metrics like BLUE, CodeBLUE and computation accuracy may not
expose these issues. In this paper we introduce a new metrics for programming
language translation and these metrics address these basic syntax errors. We
develop a novel active defects probing suite called Syntactic Unit Tests (SUT)
which includes a highly interpretable evaluation harness for accuracy and test
scoring. Experiments have shown that even powerful models like ChatGPT still
make mistakes on these basic unit tests. Specifically, compared to previous
program translation task evaluation dataset, its pass rate on our unit tests
has decreased by 26.15%. Further our evaluation harness reveal syntactic
element errors in which these models exhibit deficiencies.
Related papers
- xCOMET: Transparent Machine Translation Evaluation through Fine-grained
Error Detection [21.116517555282314]
xCOMET is an open-source learned metric designed to bridge the gap between machine translation evaluation approaches.
It integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation.
We also provide a robustness analysis with stress tests, and show that xCOMET is largely capable of identifying localized critical errors and hallucinations.
arXiv Detail & Related papers (2023-10-16T15:03:14Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Toward Human-Like Evaluation for Natural Language Generation with Error
Analysis [93.34894810865364]
Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors can produce high-quality human judgments.
This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis.
We augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors.
arXiv Detail & Related papers (2022-12-20T11:36:22Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Probing for targeted syntactic knowledge through grammatical error
detection [13.653209309144593]
We propose grammatical error detection as a diagnostic probe to evaluate pre-trained English language models.
We leverage public annotated training data from both English second language learners and Wikipedia edits.
We find that masked language models linearly encode information relevant to the detection of SVA errors, while the autoregressive models perform on par with our baseline.
arXiv Detail & Related papers (2022-10-28T16:01:25Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.