Automatic Bi-modal Question Title Generation for Stack Overflow with
Prompt Learning
- URL: http://arxiv.org/abs/2403.03677v1
- Date: Wed, 6 Mar 2024 12:58:25 GMT
- Title: Automatic Bi-modal Question Title Generation for Stack Overflow with
Prompt Learning
- Authors: Shaoyu Yang, Xiang Chen, Ke Liu, Guang Yang, Chi Yu
- Abstract summary: An initial study aimed to automatically generate the titles by only analyzing the code snippets in the question body.
We propose an approach SOTitle+ by considering bi-modal information (i.e., the code snippets and the problem descriptions) in the question body.
Our corpus includes 179,119 high-quality question posts for six popular programming languages.
- Score: 10.76882347665857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When drafting question posts for Stack Overflow, developers may not
accurately summarize the core problems in the question titles, which can cause
these questions to not get timely help. Therefore, improving the quality of
question titles has attracted the wide attention of researchers. An initial
study aimed to automatically generate the titles by only analyzing the code
snippets in the question body. However, this study ignored the helpful
information in their corresponding problem descriptions. Therefore, we propose
an approach SOTitle+ by considering bi-modal information (i.e., the code
snippets and the problem descriptions) in the question body. Then we formalize
the title generation for different programming languages as separate but
related tasks and utilize multi-task learning to solve these tasks. Later we
fine-tune the pre-trained language model CodeT5 to automatically generate the
titles. Unfortunately, the inconsistent inputs and optimization objectives
between the pre-training task and our investigated task may make fine-tuning
hard to fully explore the knowledge of the pre-trained model. To solve this
issue, SOTitle+ further prompt-tunes CodeT5 with hybrid prompts (i.e., mixture
of hard and soft prompts). To verify the effectiveness of SOTitle+, we
construct a large-scale high-quality corpus from recent data dumps shared by
Stack Overflow. Our corpus includes 179,119 high-quality question posts for six
popular programming languages. Experimental results show that SOTitle+ can
significantly outperform four state-of-the-art baselines in both automatic
evaluation and human evaluation. Our work indicates that considering bi-modal
information and prompt learning in Stack Overflow title generation is a
promising exploration direction.
Related papers
- Good things come in three: Generating SO Post Titles with Pre-Trained Models, Self Improvement and Post Ranking [5.874782446136913]
Stack Overflow is a prominent Q and A forum, supporting developers in seeking suitable resources on programming-related matters.
Having high-quality question titles is an effective means to attract developers' attention.
Research has been conducted, predominantly leveraging pre-trained models to generate titles from code snippets and problem descriptions.
We present FILLER as a solution to generating Stack Overflow post titles using a fine-tuned language model with self-improvement and post ranking.
arXiv Detail & Related papers (2024-06-21T20:18:34Z) - Answer ranking in Community Question Answering: a deep learning approach [0.0]
This work tries to advance the state of the art on answer ranking for community Question Answering by proceeding with a deep learning approach.
We created a large data set of questions and answers posted to the Stack Overflow website.
We leveraged the natural language processing capabilities of dense embeddings and LSTM networks to produce a prediction for the accepted answer attribute.
arXiv Detail & Related papers (2022-10-16T18:47:41Z) - Diverse Title Generation for Stack Overflow Posts with Multiple Sampling
Enhanced Transformer [11.03785369838242]
We propose M$_3$NSCT5, a novel approach to automatically generate multiple post titles from the given code snippets.
M$_3$NSCT5 employs the CodeT5 backbone, which is a pre-trained Transformer model having an excellent language understanding.
We build a large-scale dataset with 890,000 question posts covering eight programming languages to validate the effectiveness of M$_3$NSCT5.
arXiv Detail & Related papers (2022-08-24T13:10:48Z) - Modern Question Answering Datasets and Benchmarks: A Survey [5.026863544662493]
Question Answering (QA) is one of the most important natural language processing (NLP) tasks.
It aims using NLP technologies to generate a corresponding answer to a given question based on the massive unstructured corpus.
In this paper, we investigate influential QA datasets that have been released in the era of deep learning.
arXiv Detail & Related papers (2022-06-30T05:53:56Z) - Attention-based model for predicting question relatedness on Stack
Overflow [0.0]
We propose an Attention-based Sentence pair Interaction Model (ASIM) to predict the relatedness between questions on Stack Overflow automatically.
ASIM has made significant improvement over the baseline approaches in Precision, Recall, and Micro-F1 evaluation metrics.
Our model also performs well in the duplicate question detection task of Ask Ubuntu.
arXiv Detail & Related papers (2021-03-19T12:18:03Z) - The Influence of Domain-Based Preprocessing on Subject-Specific
Clustering [55.41644538483948]
The sudden change of moving the majority of teaching online at Universities has caused an increased amount of workload for academics.
One way to deal with this problem is to cluster these questions depending on their topic.
In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results.
arXiv Detail & Related papers (2020-11-16T17:47:19Z) - Few-Shot Complex Knowledge Base Question Answering via Meta
Reinforcement Learning [55.08037694027792]
Complex question-answering (CQA) involves answering complex natural-language questions on a knowledge base (KB)
The conventional neural program induction (NPI) approach exhibits uneven performance when the questions have different types.
This paper proposes a meta-reinforcement learning approach to program induction in CQA to tackle the potential distributional bias in questions.
arXiv Detail & Related papers (2020-10-29T18:34:55Z) - Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via
Alternate Meta-learning [56.771557756836906]
We present a novel method that automatically learns a retrieval model alternately with the programmer from weak supervision.
Our system leads to state-of-the-art performance on a large-scale task for complex question answering over knowledge bases.
arXiv Detail & Related papers (2020-10-29T18:28:16Z) - Understanding Unnatural Questions Improves Reasoning over Text [54.235828149899625]
Complex question answering (CQA) over raw text is a challenging task.
Learning an effective CQA model requires large amounts of human-annotated data.
We address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions.
arXiv Detail & Related papers (2020-10-19T10:22:16Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Semantic Graphs for Generating Deep Questions [98.5161888878238]
We propose a novel framework which first constructs a semantic-level graph for the input document and then encodes the semantic graph by introducing an attention-based GGNN (Att-GGNN)
On the HotpotQA deep-question centric dataset, our model greatly improves performance over questions requiring reasoning over multiple facts, leading to state-of-the-art performance.
arXiv Detail & Related papers (2020-04-27T10:52:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.