Synthetic Datasets for Neural Program Synthesis
- URL: http://arxiv.org/abs/1912.12345v1
- Date: Fri, 27 Dec 2019 21:28:10 GMT
- Title: Synthetic Datasets for Neural Program Synthesis
- Authors: Richard Shin, Neel Kant, Kavi Gupta, Christopher Bender, Brandon
Trabucco, Rishabh Singh, Dawn Song
- Abstract summary: We propose a new methodology for controlling and evaluating the bias of synthetic data distributions over both programs and specifications.
We demonstrate, using the Karel DSL and a small Calculator DSL, that training deep networks on these distributions leads to improved cross-distribution generalization performance.
- Score: 66.20924952964117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of program synthesis is to automatically generate programs in a
particular language from corresponding specifications, e.g. input-output
behavior. Many current approaches achieve impressive results after training on
randomly generated I/O examples in limited domain-specific languages (DSLs), as
with string transformations in RobustFill. However, we empirically discover
that applying test input generation techniques for languages with control flow
and rich input space causes deep networks to generalize poorly to certain data
distributions; to correct this, we propose a new methodology for controlling
and evaluating the bias of synthetic data distributions over both programs and
specifications. We demonstrate, using the Karel DSL and a small Calculator DSL,
that training deep networks on these distributions leads to improved
cross-distribution generalization performance.
Related papers
- The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation [34.37154877681809]
We introduce VeriDistill, the first end-to-end machine learning model that directly processes raw Verilog code to predict circuit quality-of-result metrics.
Our model employs a novel knowledge distillation method, transferring low-level circuit insights via graphs into the predictor based on LLM.
Experiments show VeriDistill outperforms state-of-the-art baselines on large-scale Verilog datasets.
arXiv Detail & Related papers (2024-10-30T04:20:10Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Enhancing Network Management Using Code Generated by Large Language
Models [15.557254786007325]
We introduce a novel approach to facilitate a natural-language-based network management experience, utilizing large language models (LLMs) to generate task-specific code from natural language queries.
This method tackles the challenges of explainability, scalability, and privacy by allowing network operators to inspect the generated code.
We design and evaluate a prototype system using benchmark applications, showcasing high accuracy, cost-effectiveness, and the potential for further enhancements.
arXiv Detail & Related papers (2023-08-11T17:49:15Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Hybridization of Capsule and LSTM Networks for unsupervised anomaly
detection on multivariate data [0.0]
This paper introduces a novel NN architecture which hybridises the Long-Short-Term-Memory (LSTM) and Capsule Networks into a single network.
The proposed method uses an unsupervised learning technique to overcome the issues with finding large volumes of labelled training data.
arXiv Detail & Related papers (2022-02-11T10:33:53Z) - Latent Execution for Neural Program Synthesis Beyond Domain-Specific
Languages [97.58968222942173]
We take the first step to synthesize C programs from input-output examples.
In particular, we propose La Synth, which learns the latent representation to approximate the execution of partially generated programs.
We show that training on these synthesized programs further improves the prediction performance for both Karel and C program synthesis.
arXiv Detail & Related papers (2021-06-29T02:21:32Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - PLANS: Robust Program Learning from Neurally Inferred Specifications [0.0]
Rule-based approaches offer correctness guarantees in an unsupervised way, while neural models are more realistically scalable to raw, high-dimensional input.
We introduce PLANS, a hybrid model for program synthesis from visual observations.
We obtain state-of-the-art performance at program synthesis from diverse demonstration videos in the Karel and ViZDoom environments.
arXiv Detail & Related papers (2020-06-05T08:51:34Z) - Creating Synthetic Datasets via Evolution for Neural Program Synthesis [77.34726150561087]
We show that some program synthesis approaches generalize poorly to data distributions different from that of the randomly generated examples.
We propose a new, adversarial approach to control the bias of synthetic data distributions and show that it outperforms current approaches.
arXiv Detail & Related papers (2020-03-23T18:34:15Z) - Controlled time series generation for automotive software-in-the-loop
testing using GANs [0.5352699766206808]
Testing automotive mechatronic systems partly uses the software-in-the-loop approach, where systematically covering inputs of the system-under-test remains a major challenge.
One approach is to craft input sequences which eases control and feedback of the test process but falls short of exposing the system to realistic scenarios.
The other is to replay sequences recorded from field operations which accounts for reality but requires collecting a well-labeled dataset of sufficient capacity for widespread use, which is expensive.
This work applies the well-known unsupervised learning framework of Generative Adrial Networks (GAN) to learn an unlabeled dataset of recorded in-vehicle
arXiv Detail & Related papers (2020-02-16T16:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.