Related papers: Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer

Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer

URL: http://arxiv.org/abs/2208.11523v1
Date: Wed, 24 Aug 2022 13:10:48 GMT
Title: Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer
Authors: Fengji Zhang, Jin Liu, Yao Wan, Xiao Yu, Xiao Liu, Jacky Keung
Abstract summary: We propose M$_3$NSCT5, a novel approach to automatically generate multiple post titles from the given code snippets. M$_3$NSCT5 employs the CodeT5 backbone, which is a pre-trained Transformer model having an excellent language understanding. We build a large-scale dataset with 890,000 question posts covering eight programming languages to validate the effectiveness of M$_3$NSCT5.
Score: 11.03785369838242
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stack Overflow is one of the most popular programming communities where developers can seek help for their encountered problems. Nevertheless, if inexperienced developers fail to describe their problems clearly, it is hard for them to attract sufficient attention and get the anticipated answers. We propose M$_3$NSCT5, a novel approach to automatically generate multiple post titles from the given code snippets. Developers may use the generated titles to find closely related posts and complete their problem descriptions. M$_3$NSCT5 employs the CodeT5 backbone, which is a pre-trained Transformer model having an excellent language understanding and generation ability. To alleviate the ambiguity issue that the same code snippets could be aligned with different titles under varying contexts, we propose the maximal marginal multiple nucleus sampling strategy to generate multiple high-quality and diverse title candidates at a time for the developers to choose from. We build a large-scale dataset with 890,000 question posts covering eight programming languages to validate the effectiveness of M$_3$NSCT5. The automatic evaluation results on the BLEU and ROUGE metrics demonstrate the superiority of M$_3$NSCT5 over six state-of-the-art baseline models. Moreover, a human evaluation with trustworthy results also demonstrates the great potential of our approach for real-world application.

Related papers

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [18.886738819470086]
We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance.<n>Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues.<n>Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories.
arXiv Detail & Related papers (2025-07-14T17:19:00Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data.<n>It comprises question-solution-test triplets that are systematically validated via a self-verification procedure.<n>This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
Uncovering Weaknesses in Neural Code Generation [21.552898575210534]
We assess the quality of generated code using match-based and execution-based metrics, then conduct thematic analysis to develop a taxonomy of nine types of weaknesses. In the CoNaLa dataset, inaccurate prompts are a notable problem, causing all large models to fail in 26.84% of cases. Missing pivotal semantics is a pervasive issue across benchmarks, with one or more large models omitting key semantics in 65.78% of CoNaLa tasks. All models struggle with proper API usage, a challenge amplified by vague or complex prompts.
arXiv Detail & Related papers (2024-07-13T07:31:43Z)
On Speeding Up Language Model Evaluation [48.51924035873411]
We propose an $textitadaptive$ approach to explore this space. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate. We show that it can identify the top-performing method using only 5-15% of the typical resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
Good things come in three: Generating SO Post Titles with Pre-Trained Models, Self Improvement and Post Ranking [5.874782446136913]
Stack Overflow is a prominent Q and A forum, supporting developers in seeking suitable resources on programming-related matters. Having high-quality question titles is an effective means to attract developers' attention. Research has been conducted, predominantly leveraging pre-trained models to generate titles from code snippets and problem descriptions. We present FILLER as a solution to generating Stack Overflow post titles using a fine-tuned language model with self-improvement and post ranking.
arXiv Detail & Related papers (2024-06-21T20:18:34Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
Validating LLM-Generated Programs with Metamorphic Prompt Testing [8.785973653167112]
Large Language Models (LLMs) are increasingly integrated into the software development lifecycle. This paper proposes a novel solution called metamorphic prompt testing to address these challenges. Our evaluation on HumanEval shows that metamorphic prompt testing is able to detect 75 percent of the erroneous programs generated by GPT-4, with a false positive rate of 8.6 percent.
arXiv Detail & Related papers (2024-06-11T00:40:17Z)
Automatic Bi-modal Question Title Generation for Stack Overflow with Prompt Learning [10.76882347665857]
An initial study aimed to automatically generate the titles by only analyzing the code snippets in the question body. We propose an approach SOTitle+ by considering bi-modal information (i.e., the code snippets and the problem descriptions) in the question body. Our corpus includes 179,119 high-quality question posts for six popular programming languages.
arXiv Detail & Related papers (2024-03-06T12:58:25Z)
PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models [10.491051578439722]
We propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems.
arXiv Detail & Related papers (2024-01-28T02:27:38Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models [58.41943058963672]
We propose a new inference framework called Recursion of Thought (RoT) RoT introduces several special tokens that the models can output to trigger context-related operations. Experiments with multiple architectures including GPT-3 show that RoT dramatically improves LMs' inference capability to solve problems.
arXiv Detail & Related papers (2023-06-12T06:34:16Z)
Coder Reviewer Reranking for Code Generation [56.80381384717]
We propose Coder-Reviewer reranking as a method for sampling diverse programs from a code language model and reranking with model likelihood. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement over reranking with the Coder model only. Coder-Reviewer reranking is easy to implement by prompting, can generalize to different programming languages, and works well with off-the-shelf hyper parameters.
arXiv Detail & Related papers (2022-11-29T18:56:33Z)
$\textit{latent}$-GLAT: Glancing at Latent Variables for Parallel Text Generation [65.29170569821093]
parallel text generation has received widespread attention due to its success in generation efficiency. In this paper, we propose $textitlatent$-GLAT, which employs the discrete latent variables to capture word categorical information. Experiment results show that our method outperforms strong baselines without the help of an autoregressive model.
arXiv Detail & Related papers (2022-04-05T07:34:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.