Self-Supervised Query Reformulation for Code Search
- URL: http://arxiv.org/abs/2307.00267v1
- Date: Sat, 1 Jul 2023 08:17:23 GMT
- Title: Self-Supervised Query Reformulation for Code Search
- Authors: Yuetian Mao, Chengcheng Wan, Yuze Jiang, Xiaodong Gu
- Abstract summary: We propose SSQR, a self-supervised query reformulation method that does not rely on any parallel query corpus.
Inspired by pre-trained models, SSQR treats query reformulation as a masked language modeling task.
- Score: 6.415583252034772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic query reformulation is a widely utilized technology for enriching
user requirements and enhancing the outcomes of code search. It can be
conceptualized as a machine translation task, wherein the objective is to
rephrase a given query into a more comprehensive alternative. While showing
promising results, training such a model typically requires a large parallel
corpus of query pairs (i.e., the original query and a reformulated query) that
are confidential and unpublished by online code search engines. This restricts
its practicality in software development processes. In this paper, we propose
SSQR, a self-supervised query reformulation method that does not rely on any
parallel query corpus. Inspired by pre-trained models, SSQR treats query
reformulation as a masked language modeling task conducted on an extensive
unannotated corpus of queries. SSQR extends T5 (a sequence-to-sequence model
based on Transformer) with a new pre-training objective named corrupted query
completion (CQC), which randomly masks words within a complete query and trains
T5 to predict the masked content. Subsequently, for a given query to be
reformulated, SSQR identifies potential locations for expansion and leverages
the pre-trained T5 model to generate appropriate content to fill these gaps.
The selection of expansions is then based on the information gain associated
with each candidate. Evaluation results demonstrate that SSQR outperforms
unsupervised baselines significantly and achieves competitive performance
compared to supervised methods.
Related papers
- Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs [51.33342412699939]
Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs.
Recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries.
We propose an effective Query Instruction Parsing (QIPP) that captures latent query patterns from code-like query instructions.
arXiv Detail & Related papers (2024-10-27T03:18:52Z) - GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval [20.807374287510623]
We propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively.
We show that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10.
arXiv Detail & Related papers (2024-09-17T05:59:32Z) - Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers [66.55612528039894]
AdaQR is a framework for training query rewriting models with limited rewrite annotations from seed datasets and completely no passage label.
A novel approach is proposed to assess retriever's preference for these candidates by the probability of answers conditioned on the conversational query.
arXiv Detail & Related papers (2024-06-16T16:09:05Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - ConvGQR: Generative Query Reformulation for Conversational Search [37.54018632257896]
ConvGQR is a new framework to reformulate conversational queries based on generative pre-trained language models.
We propose a knowledge infusion mechanism to optimize both query reformulation and retrieval.
arXiv Detail & Related papers (2023-05-25T01:45:06Z) - Decoding a Neural Retriever's Latent Space for Query Suggestion [28.410064376447718]
We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph.
We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco.
On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion.
arXiv Detail & Related papers (2022-10-21T16:19:31Z) - Query Expansion and Entity Weighting for Query Reformulation Retrieval
in Voice Assistant Systems [6.590172620606211]
Voice assistants such as Alexa, Siri, and Google Assistant have become increasingly popular worldwide.
linguistic variations, variability of speech patterns, ambient acoustic conditions, and other such factors are often correlated with the assistants misinterpreting the user's query.
Retrieval based query reformulation (QR) systems are widely used to reformulate those misinterpreted user queries.
arXiv Detail & Related papers (2022-02-22T23:03:29Z) - End-to-End Open Vocabulary Keyword Search [13.90172596423425]
We propose a model directly optimized for keyword search.
The proposed model outperforms similar end-to-end models on a task where the ratio of positive and negative trials is artificially balanced.
Using our system to rescore the outputs an LVCSR-based keyword search system leads to significant improvements.
arXiv Detail & Related papers (2021-08-23T18:34:53Z) - Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting [54.03356526990088]
We propose Sequence Span Rewriting (SSR) as a self-supervised sequence-to-sequence (seq2seq) pre-training objective.
SSR provides more fine-grained learning signals for text representations by supervising the model to rewrite imperfect spans to ground truth.
Our experiments with T5 models on various seq2seq tasks show that SSR can substantially improve seq2seq pre-training.
arXiv Detail & Related papers (2021-01-02T10:27:11Z) - Session-Aware Query Auto-completion using Extreme Multi-label Ranking [61.753713147852125]
We take the novel approach of modeling session-aware query auto-completion as an e Multi-Xtreme Ranking (XMR) problem.
We adapt a popular XMR algorithm for this purpose by proposing several modifications to the key steps in the algorithm.
Our approach meets the stringent latency requirements for auto-complete systems while leveraging session information in making suggestions.
arXiv Detail & Related papers (2020-12-09T17:56:22Z) - Query Resolution for Conversational Search with Limited Supervision [63.131221660019776]
We propose QuReTeC (Query Resolution by Term Classification), a neural query resolution model based on bidirectional transformers.
We show that QuReTeC outperforms state-of-the-art models, and furthermore, that our distant supervision method can be used to substantially reduce the amount of human-curated data required to train QuReTeC.
arXiv Detail & Related papers (2020-05-24T11:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.