Automatic Generation of Python Programs Using Context-Free Grammars
- URL: http://arxiv.org/abs/2403.06503v1
- Date: Mon, 11 Mar 2024 08:25:52 GMT
- Title: Automatic Generation of Python Programs Using Context-Free Grammars
- Authors: Kamel Yamani, Marwa Na\"ir, Riyadh Baghdadi
- Abstract summary: TinyPy Generator is a tool that generates random Python programs using a context-free grammar.
Our system uses custom production rules to generate code with different levels of complexity.
TinyPy Generator is useful in the field of machine learning, where it can generate substantial amounts of Python code for training Python language models.
- Score: 0.1227734309612871
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, data has emerged as the new gold, serving as a powerful tool
for creating intelligent systems. However, procuring high-quality data remains
challenging, especially for code. To address this, we developed TinyPy
Generator, a tool that generates random Python programs using a context-free
grammar. The generated programs are guaranteed to be correct by construction.
Our system uses custom production rules (in the Backus-Naur Form (BNF) format)
to recursively generate code. This allows us to generate code with different
levels of complexity, ranging from code containing only assignments to more
complex code containing conditionals and loops. Our proposed tool enables
effortless large-scale Python code generation, beneficial for a wide range of
applications. TinyPy Generator is particularly useful in the field of machine
learning, where it can generate substantial amounts of Python code for training
Python language models. Additionally, researchers who are studying programming
languages can utilize this tool to create datasets for their experiments, which
can help validate the robustness of code interpreters or compilers. Unlike
existing research, we have open-sourced our implementation. This allows
customization according to user needs and extends potential usage to other
languages.
Related papers
- How Maintainable is Proficient Code? A Case Study of Three PyPI Libraries [3.0105723746073]
We investigate the risk level of proficient code inside a file.
We identify several instances of high proficient code that was also high risk.
We envision that the study should help developers identify scenarios where proficient code might be detrimental to future code maintenance activities.
arXiv Detail & Related papers (2024-10-08T04:45:11Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - A Comprehensive Guide to Combining R and Python code for Data Science, Machine Learning and Reinforcement Learning [42.350737545269105]
We show how to run Python's scikit-learn, pytorch and OpenAI gym libraries for building Machine Learning, Deep Learning, and Reinforcement Learning projects easily.
arXiv Detail & Related papers (2024-07-19T23:01:48Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning [84.12154024070024]
We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks.
Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge.
A Python interpreter then executes the generated code and prints the output.
arXiv Detail & Related papers (2023-09-19T17:54:21Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - GAP-Gen: Guided Automatic Python Code Generation [3.574838772430975]
We propose a Guided Automatic Python Code Generation method based on Python syntactic constraints and semantic constraints.
GAP-Gen fine-tunes the transformer based language models T5 and CodeT5 using the Code-to-Docstring datasets.
Our experiments show that GAP-Gen achieves better results on automatic Python code generation task than previous works.
arXiv Detail & Related papers (2022-01-19T06:32:47Z) - Lyra: A Benchmark for Turducken-Style Code Generation [15.810088578588028]
In software development, one programming language is often embedded in another.
This paper defines a new code generation task: given a natural language comment, this task aims to generate a program in a base language with an embedded language.
To our knowledge, this is the first turducken-style code generation task.
arXiv Detail & Related papers (2021-08-27T07:22:55Z) - Natural Language-guided Programming [1.3955252961896318]
We put forward a vision based on a new breed of developer tools that have the potential to largely automate this process.
Key idea is to adapt code autocompletion tools such that they take into account not only the developer's already-written code but also the intent of the task the developer is trying to achieve next.
We call this practice of enriching the code with natural language intent to facilitate its completion natural language-guided programming.
arXiv Detail & Related papers (2021-08-11T13:06:33Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.