PyMT5: multi-mode translation of natural language and Python code with
transformers
- URL: http://arxiv.org/abs/2010.03150v1
- Date: Wed, 7 Oct 2020 04:10:58 GMT
- Title: PyMT5: multi-mode translation of natural language and Python code with
transformers
- Authors: Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy,
Neel Sundaresan
- Abstract summary: PyMT5 is a Python method text-to-text transfer transformer.
It can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style.
- Score: 7.973871379728246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simultaneously modeling source code and natural language has many exciting
applications in automated software development and understanding. Pursuant to
achieving such technology, we introduce PyMT5, the Python method text-to-text
transfer transformer, which is trained to translate between all pairs of Python
method feature combinations: a single model that can both predict whole methods
from natural language documentation strings (docstrings) and summarize code
into docstrings of any common style. We present an analysis and modeling effort
of a large-scale parallel corpus of 26 million Python methods and 7.7 million
method-docstring pairs, demonstrating that for docstring and method generation,
PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which
were English pre-trained or randomly initialized. On the CodeSearchNet test
set, our best model predicts 92.1% syntactically correct method bodies,
achieved a BLEU score of 8.59 for method generation and 16.3 for docstring
generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method
generation and 36.7 for docstring generation.
Related papers
- Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts [51.49688654641581]
We propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages.
Experimental results reveal that it significantly outperforms Python Self-Consistency.
In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701)
arXiv Detail & Related papers (2024-02-16T13:48:06Z) - Test-Time Training on Nearest Neighbors for Large Language Models [25.365366617508663]
We build a large-scale distributed index based on text embeddings of the Pile dataset.
For each test input, our system retrieves its neighbors and fine-tunes the model on their text.
Surprisingly, retrieving and training on as few as 20 neighbors drastically improves performance across more than 20 language modeling tasks.
arXiv Detail & Related papers (2023-05-29T08:03:28Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - GAP-Gen: Guided Automatic Python Code Generation [3.574838772430975]
We propose a Guided Automatic Python Code Generation method based on Python syntactic constraints and semantic constraints.
GAP-Gen fine-tunes the transformer based language models T5 and CodeT5 using the Code-to-Docstring datasets.
Our experiments show that GAP-Gen achieves better results on automatic Python code generation task than previous works.
arXiv Detail & Related papers (2022-01-19T06:32:47Z) - Program Synthesis with Large Language Models [40.41120807053989]
We evaluate large language models for program synthesis in Python.
We find that synthesis performance scales log-linearly with model size.
We find that even our best models are generally unable to predict the output of a program given a specific input.
arXiv Detail & Related papers (2021-08-16T03:57:30Z) - Automatic Code Generation using Pre-Trained Language Models [0.0]
We propose an end-to-end machine learning model for code generation in the Python language built on-top of pre-trained language models.
We demonstrate that a fine-tuned model can perform well in code generation tasks, achieving a BLEU score of 0.22, an improvement of 46% over a reasonable sequence-to-sequence baseline.
arXiv Detail & Related papers (2021-02-21T07:21:26Z) - PyHealth: A Python Library for Health Predictive Models [53.848478115284195]
PyHealth is an open-source Python toolbox for developing various predictive models on healthcare data.
The data preprocessing module enables the transformation of complex healthcare datasets into machine learning friendly formats.
The predictive modeling module provides more than 30 machine learning models, including established ensemble trees and deep neural network-based approaches.
arXiv Detail & Related papers (2021-01-11T22:02:08Z) - XLM-T: Scaling up Multilingual Machine Translation with Pretrained
Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data.
This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z) - Language ID in the Wild: Unexpected Challenges on the Path to a
Thousand-Language Web Text Corpus [15.807197703827818]
We train LangID models on up to 1,629 languages with comparable quality on held-out test sets.
We find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages.
We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models.
arXiv Detail & Related papers (2020-10-27T19:29:17Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.