PromptSet: A Programmer's Prompting Dataset
- URL: http://arxiv.org/abs/2402.16932v1
- Date: Mon, 26 Feb 2024 16:34:29 GMT
- Title: PromptSet: A Programmer's Prompting Dataset
- Authors: Kaiser Pister, Dhruba Jyoti Paul, Patrick Brophy, Ishan Joshi
- Abstract summary: We present a novel dataset called PromptSet, with more than 61,000 unique developer prompts used in open source Python programs.
We perform analysis on this dataset and introduce the notion of a static linter for prompts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rise of capabilities expressed by large language models has been quickly
followed by the integration of the same complex systems into application level
logic. Algorithms, programs, systems, and companies are built around structured
prompting to black box models where the majority of the design and
implementation lies in capturing and quantifying the `agent mode'. The standard
way to shape a closed language model is to prime it for a specific task with a
tailored prompt, often initially handwritten by a human. The textual prompts
co-evolve with the codebase, taking shape over the course of project life as
artifacts which must be reviewed and maintained, just as the traditional code
files might be. Unlike traditional code, we find that prompts do not receive
effective static testing and linting to prevent runtime issues. In this work,
we present a novel dataset called PromptSet, with more than 61,000 unique
developer prompts used in open source Python programs. We perform analysis on
this dataset and introduce the notion of a static linter for prompts. Released
with this publication is a HuggingFace dataset and a Github repository to
recreate collection and processing efforts, both under the name
\texttt{pisterlabs/promptset}.
Related papers
- Statically Contextualizing Large Language Models with Typed Holes [4.180458188910334]
Large language models (LLMs) have reshaped the landscape of program synthesis.
LLMs often hallucinate broken code because they lack appropriate context.
This paper demonstrates that tight integration with the type and binding structure of a language can address this contextualization problem.
arXiv Detail & Related papers (2024-09-02T03:29:00Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - Python Code Generation by Asking Clarification Questions [57.63906360576212]
In this work, we introduce a novel and more realistic setup for this task.
We hypothesize that the under-specification of a natural language description can be resolved by asking clarification questions.
We collect and introduce a new dataset named CodeClarQA containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
arXiv Detail & Related papers (2022-12-19T22:08:36Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Repository-Level Prompt Generation for Large Language Models of Code [28.98699307030983]
We propose a framework that learns to generate example-specific prompts using prompt proposals.
The prompt proposals take context from the entire repository.
We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives.
arXiv Detail & Related papers (2022-06-26T10:51:25Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods
in Natural Language Processing [78.8500633981247]
This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning"
Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
arXiv Detail & Related papers (2021-07-28T18:09:46Z) - Learning How to Ask: Querying LMs with Mixtures of Soft Prompts [33.43689407735244]
Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks.
We explore the idea of learning prompts by gradient descent.
For each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them.
arXiv Detail & Related papers (2021-04-14T02:56:14Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.