Incorporating Domain Knowledge through Task Augmentation for Front-End
JavaScript Code Generation
- URL: http://arxiv.org/abs/2208.10091v2
- Date: Tue, 23 Aug 2022 01:45:33 GMT
- Title: Incorporating Domain Knowledge through Task Augmentation for Front-End
JavaScript Code Generation
- Authors: Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, Ge Li
- Abstract summary: In some domain-specific scenarios, building such a large paired corpus for code generation is difficult because there is no directly available pairing data.
We propose a task augmentation method that incorporates domain knowledge into code generation models through auxiliary tasks and a Subtoken-TranX model.
Our experimental results demonstrate that the subtoken-level TranX model outperforms the original TranX model and the Transformer model on our dataset.
- Score: 10.75138604869187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation aims to generate a code snippet automatically from natural
language descriptions. Generally, the mainstream code generation methods rely
on a large amount of paired training data, including both the natural language
description and the code. However, in some domain-specific scenarios, building
such a large paired corpus for code generation is difficult because there is no
directly available pairing data, and a lot of effort is required to manually
write the code descriptions to construct a high-quality training dataset. Due
to the limited training data, the generation model cannot be well trained and
is likely to be overfitting, making the model's performance unsatisfactory for
real-world use. To this end, in this paper, we propose a task augmentation
method that incorporates domain knowledge into code generation models through
auxiliary tasks and a Subtoken-TranX model by extending the original TranX
model to support subtoken-level code generation. To verify our proposed
approach, we collect a real-world code generation dataset and conduct
experiments on it. Our experimental results demonstrate that the subtoken-level
TranX model outperforms the original TranX model and the Transformer model on
our dataset, and the exact match accuracy of Subtoken-TranX improves
significantly by 12.75% with the help of our task augmentation method. The
model performance on several code categories has satisfied the requirements for
application in industrial systems. Our proposed approach has been adopted by
Alibaba's BizCook platform. To the best of our knowledge, this is the first
domain code generation system adopted in industrial development environments.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Better Language Models of Code through Self-Improvement [18.75015225501755]
We propose a simple data augmentation framework for pre-trained language models for code (PLMCs)
Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step.
The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks.
arXiv Detail & Related papers (2023-04-02T10:59:19Z) - Generation-Augmented Query Expansion For Code Retrieval [51.20943646688115]
We propose a generation-augmented query expansion framework.
Inspired by the human retrieval process - sketching an answer before searching.
We achieve new state-of-the-art results on the CodeSearchNet benchmark.
arXiv Detail & Related papers (2022-12-20T23:49:37Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - NatGen: Generative pre-training by "Naturalizing" source code [18.410818213965918]
We propose a new pre-training objective, "Naturalizing" of source code.
Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale.
We fine-tune our model in three generative Software Engineering tasks to achieve state-of-the-art performance rivaling CodeT5.
arXiv Detail & Related papers (2022-06-15T15:08:29Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.