Code to Comment Translation: A Comparative Study on Model Effectiveness
& Errors
- URL: http://arxiv.org/abs/2106.08415v1
- Date: Tue, 15 Jun 2021 20:13:14 GMT
- Title: Code to Comment Translation: A Comparative Study on Model Effectiveness
& Errors
- Authors: Junayed Mahmud, Fahim Faisal, Raihan Islam Arnob, Antonios
Anastasopoulos, Kevin Moran
- Abstract summary: Machine translation models are employed to "translate" code snippets into relevant natural language descriptions.
Most evaluations of such models are conducted using automatic reference-based metrics.
We compare three recently proposed source code summarization models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics.
Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy.
- Score: 19.653423881863834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated source code summarization is a popular software engineering
research topic wherein machine translation models are employed to "translate"
code snippets into relevant natural language descriptions. Most evaluations of
such models are conducted using automatic reference-based metrics. However,
given the relatively large semantic gap between programming languages and
natural language, we argue that this line of research would benefit from a
qualitative investigation into the various error modes of current
state-of-the-art models. Therefore, in this work, we perform both a
quantitative and qualitative comparison of three recently proposed source code
summarization models. In our quantitative evaluation, we compare the models
based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics,
and in our qualitative evaluation, we perform a manual open-coding of the most
common errors committed by the models when compared to ground truth captions.
Our investigation reveals new insights into the relationship between
metric-based performance and model prediction errors grounded in an empirically
derived error taxonomy that can be used to drive future research efforts
Related papers
- Multilingual Models for Check-Worthy Social Media Posts Detection [0.552480439325792]
The study includes a comprehensive analysis of different models, with a special focus on multilingual models.
The novelty of this work lies in the development of multi-label multilingual classification models that can simultaneously detect harmful posts and posts that contain verifiable factual claims in an efficient way.
arXiv Detail & Related papers (2024-08-13T08:55:28Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Exploring Automatic Evaluation Methods based on a Decoder-based LLM for
Text Generation [16.78350863261211]
This paper compares various methods, including tuning with encoder-based models and large language models under equal conditions.
Experimental results show that compared to the tuned encoder-based models, the tuned decoder-based models perform poorly.
It is also revealed that in-context learning of very large decoder-based models such as ChatGPT makes it difficult to identify fine-grained semantic differences.
arXiv Detail & Related papers (2023-10-17T06:53:00Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling.
Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training.
Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z) - An Application of Pseudo-Log-Likelihoods to Natural Language Scoring [5.382454613390483]
A language model with relatively few parameters and training steps can outperform it on a recent large data set.
We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks.
We argue that robustness of the smaller model ought to be understood in terms of compositionality.
arXiv Detail & Related papers (2022-01-23T22:00:54Z) - Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward.
But the evaluation process becomes tricky as soon as we switch from development to application.
We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Comparative Study of Language Models on Cross-Domain Data with Model
Agnostic Explainability [0.0]
The study compares the state-of-the-art language models - BERT, ELECTRA and its derivatives which include RoBERTa, ALBERT and DistilBERT.
The experimental results establish new state-of-the-art for 2013 rating classification task and Financial Phrasebank sentiment detection task with 69% accuracy and 88.2% accuracy respectively.
arXiv Detail & Related papers (2020-09-09T04:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.