Improving Content Retrievability in Search with Controllable Query
Generation
- URL: http://arxiv.org/abs/2303.11648v1
- Date: Tue, 21 Mar 2023 07:46:57 GMT
- Title: Improving Content Retrievability in Search with Controllable Query
Generation
- Authors: Gustavo Penha, Enrico Palumbo, Maryam Aziz, Alice Wang and Hugues
Bouchard
- Abstract summary: Machine-learned search engines have a high retrievability bias, where the majority of the queries return the same entities.
We propose CtrlQGen, a method that generates queries for a chosen underlying intent-narrow or broad.
Our results on datasets from the domains of music, podcasts, and books reveal that we can significantly decrease the retrievability bias of a dense retrieval model.
- Score: 5.450798147045502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: An important goal of online platforms is to enable content discovery, i.e.
allow users to find a catalog entity they were not familiar with. A
pre-requisite to discover an entity, e.g. a book, with a search engine is that
the entity is retrievable, i.e. there are queries for which the system will
surface such entity in the top results. However, machine-learned search engines
have a high retrievability bias, where the majority of the queries return the
same entities. This happens partly due to the predominance of narrow intent
queries, where users create queries using the title of an already known entity,
e.g. in book search 'harry potter'. The amount of broad queries where users
want to discover new entities, e.g. in music search 'chill lyrical electronica
with an atmospheric feeling to it', and have a higher tolerance to what they
might find, is small in comparison. We focus here on two factors that have a
negative impact on the retrievability of the entities (I) the training data
used for dense retrieval models and (II) the distribution of narrow and broad
intent queries issued in the system. We propose CtrlQGen, a method that
generates queries for a chosen underlying intent-narrow or broad. We can use
CtrlQGen to improve factor (I) by generating training data for dense retrieval
models comprised of diverse synthetic queries. CtrlQGen can also be used to
deal with factor (II) by suggesting queries with broader intents to users. Our
results on datasets from the domains of music, podcasts, and books reveal that
we can significantly decrease the retrievability bias of a dense retrieval
model when using CtrlQGen. First, by using the generated queries as training
data for dense models we make 9% of the entities retrievable (go from zero to
non-zero retrievability). Second, by suggesting broader queries to users, we
can make 12% of the entities retrievable in the best case.
Related papers
- Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval [49.42043077545341]
We propose a knowledge-aware query expansion framework, augmenting LLMs with structured document relations from knowledge graph (KG)
We leverage document texts as rich KG node representations and use document-based relation filtering for our Knowledge-Aware Retrieval (KAR)
arXiv Detail & Related papers (2024-10-17T17:03:23Z) - Improving Retrieval in Sponsored Search by Leveraging Query Context Signals [6.152499434499752]
We propose an approach to enhance query understanding by augmenting queries with rich contextual signals.
We use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations.
Our context-aware approach substantially outperforms context-free models.
arXiv Detail & Related papers (2024-07-19T14:28:53Z) - BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
Many complex real-world queries require in-depth reasoning to identify relevant documents.
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.
arXiv Detail & Related papers (2024-07-16T17:58:27Z) - Database-Augmented Query Representation for Information Retrieval [59.57065228857247]
We present a novel retrieval framework called Database-Augmented Query representation (DAQu)
DAQu augments the original query with various (query-related) metadata across multiple tables.
We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database.
arXiv Detail & Related papers (2024-06-23T05:02:21Z) - Enhanced Facet Generation with LLM Editing [5.4327243200369555]
In information retrieval, facet identification of a user query is an important task.
Previous studies can enhance facet prediction by leveraging retrieved documents and related queries obtained through a search engine.
However, there are challenges in extending it to other applications when a search engine operates as part of the model.
arXiv Detail & Related papers (2024-03-25T00:43:44Z) - Corrective Retrieval Augmented Generation [36.04062963574603]
Retrieval-augmented generation (RAG) relies heavily on relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong.
We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation.
CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches.
arXiv Detail & Related papers (2024-01-29T04:36:39Z) - Decoding a Neural Retriever's Latent Space for Query Suggestion [28.410064376447718]
We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph.
We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco.
On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion.
arXiv Detail & Related papers (2022-10-21T16:19:31Z) - Graph Enhanced BERT for Query Understanding [55.90334539898102]
query understanding plays a key role in exploring users' search intents and facilitating users to locate their most desired information.
In recent years, pre-trained language models (PLMs) have advanced various natural language processing tasks.
We propose a novel graph-enhanced pre-training framework, GE-BERT, which can leverage both query content and the query graph.
arXiv Detail & Related papers (2022-04-03T16:50:30Z) - Improving Candidate Retrieval with Entity Profile Generation for
Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling.
We use the profile to query the indexed search engine to retrieve candidate entities.
Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - APRF-Net: Attentive Pseudo-Relevance Feedback Network for Query
Categorization [12.634704014206294]
We propose a novel deep neural model named textbfAttentive textbfPseudo textbfRelevance textbfFeedback textbfNetwork (APRF-Net) to enhance the representation of rare queries for query categorization.
Our results show that the APRF-Net significantly improves query categorization by 5.9% on $F1@1$ score over the baselines, which increases to 8.2% improvement for the rare queries.
arXiv Detail & Related papers (2021-04-23T02:34:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.