Exploring the Viability of Synthetic Query Generation for Relevance
Prediction
- URL: http://arxiv.org/abs/2305.11944v2
- Date: Fri, 16 Jun 2023 22:00:21 GMT
- Title: Exploring the Viability of Synthetic Query Generation for Relevance
Prediction
- Authors: Aditi Chaudhary, Karthik Raman, Krishna Srinivasan, Kazuma Hashimoto,
Mike Bendersky, Marc Najork
- Abstract summary: We conduct a study into how QGen approaches can be leveraged for nuanced relevance prediction.
We identify new shortcomings of existing QGen approaches -- including their inability to distinguish between different grades of relevance.
We introduce label-grained QGen models which incorporates knowledge about the different relevance.
- Score: 18.77909480819682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Query-document relevance prediction is a critical problem in Information
Retrieval systems. This problem has increasingly been tackled using
(pretrained) transformer-based models which are finetuned using large
collections of labeled data. However, in specialized domains such as e-commerce
and healthcare, the viability of this approach is limited by the dearth of
large in-domain data. To address this paucity, recent methods leverage these
powerful models to generate high-quality task and domain-specific synthetic
data. Prior work has largely explored synthetic data generation or query
generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance
prediction, where for instance, the QGen models are given a document, and
trained to generate a query relevant to that document. However in many
problems, we have a more fine-grained notion of relevance than a simple yes/no
label. Thus, in this work, we conduct a detailed study into how QGen approaches
can be leveraged for nuanced relevance prediction. We demonstrate that --
contrary to claims from prior works -- current QGen approaches fall short of
the more conventional cross-domain transfer-learning approaches. Via empirical
studies spanning 3 public e-commerce benchmarks, we identify new shortcomings
of existing QGen approaches -- including their inability to distinguish between
different grades of relevance. To address this, we introduce label-conditioned
QGen models which incorporates knowledge about the different relevance. While
our experiments demonstrate that these modifications help improve performance
of QGen techniques, we also find that QGen approaches struggle to capture the
full nuance of the relevance label space and as a result the generated queries
are not faithful to the desired relevance label.
Related papers
- RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation [42.82192656794179]
Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses.
This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios.
Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process.
arXiv Detail & Related papers (2024-03-31T08:58:54Z) - It's All Relative! -- A Synthetic Query Generation Approach for
Improving Zero-Shot Relevance Prediction [19.881193965130173]
Large language models (LLMs) have shown promise in their ability to generate synthetic query-document pairs by prompting with as few as 8 demonstrations.
We propose to reduce this burden by generating queries simultaneously for different labels.
arXiv Detail & Related papers (2023-11-14T06:16:49Z) - QASnowball: An Iterative Bootstrapping Framework for High-Quality
Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball)
QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples.
We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z) - Event Extraction as Question Generation and Answering [72.04433206754489]
Recent work on Event Extraction has reframed the task as Question Answering (QA)
We propose QGA-EE, which enables a Question Generation (QG) model to generate questions that incorporate rich contextual information instead of using fixed templates.
Experiments show that QGA-EE outperforms all prior single-task-based models on the ACE05 English dataset.
arXiv Detail & Related papers (2023-07-10T01:46:15Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - EQG-RACE: Examination-Type Question Generation [21.17100754955864]
We propose an innovative Examination-type Question Generation approach (EQG-RACE) to generate exam-like questions based on a dataset extracted from RACE.
Two main strategies are employed in EQG-RACE for dealing with discrete answer information and reasoning among long contexts.
Experimental results show a state-of-the-art performance of EQG-RACE, which is apparently superior to the baselines.
arXiv Detail & Related papers (2020-12-11T03:52:17Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.