AmazonQAC: A Large-Scale, Naturalistic Query Autocomplete Dataset
- URL: http://arxiv.org/abs/2411.04129v1
- Date: Tue, 22 Oct 2024 21:11:34 GMT
- Title: AmazonQAC: A Large-Scale, Naturalistic Query Autocomplete Dataset
- Authors: Dante Everaert, Rohit Patki, Tianqi Zheng, Christopher Potts,
- Abstract summary: We introduce AmazonQAC, a new QAC dataset sourced from Amazon Search logs, comprising 395M samples.
The dataset includes actual sequences of user-typed prefixes leading to final search terms, as well as session IDs and timestamps.
We assess Prefix Trees, semantic retrieval, and Large Language Models (LLMs) with and without finetuning.
- Score: 14.544120039123934
- License:
- Abstract: Query Autocomplete (QAC) is a critical feature in modern search engines, facilitating user interaction by predicting search queries based on input prefixes. Despite its widespread adoption, the absence of large-scale, realistic datasets has hindered advancements in QAC system development. This paper addresses this gap by introducing AmazonQAC, a new QAC dataset sourced from Amazon Search logs, comprising 395M samples. The dataset includes actual sequences of user-typed prefixes leading to final search terms, as well as session IDs and timestamps that support modeling the context-dependent aspects of QAC. We assess Prefix Trees, semantic retrieval, and Large Language Models (LLMs) with and without finetuning. We find that finetuned LLMs perform best, particularly when incorporating contextual information. However, even our best system achieves only half of what we calculate is theoretically possible on our test data, which implies QAC is a challenging problem that is far from solved with existing systems. This contribution aims to stimulate further research on QAC systems to better serve user needs in diverse environments. We open-source this data on Hugging Face at https://huggingface.co/datasets/amazon/AmazonQAC.
Related papers
- IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization [59.06663981902496]
Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization.
We investigate two indispensable characteristics that the LLMs-based QFS models should be harnessed, Lengthy Document Summarization and Efficiently Fine-grained Query-LLM Alignment.
These innovations pave the way for broader application and accessibility in the field of QFS technology.
arXiv Detail & Related papers (2024-07-15T07:14:56Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs [3.24692739098077]
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning.
We evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting.
We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models.
arXiv Detail & Related papers (2024-06-24T22:09:50Z) - Seasonality Based Reranking of E-commerce Autocomplete Using Natural
Language Queries [15.37457156804212]
Query autocomplete (QAC) also known as typeahead, suggests list of complete queries as user types prefix in the search box.
One of the goals of typeahead is to suggest relevant queries to users which are seasonally important.
We propose a neural network based natural language processing (NLP) algorithm to incorporate seasonality as a signal.
arXiv Detail & Related papers (2023-08-03T21:14:25Z) - Improving Text Matching in E-Commerce Search with A Rationalizable,
Intervenable and Fast Entity-Based Relevance Model [78.80174696043021]
We propose a novel model called the Entity-Based Relevance Model (EBRM)
The decomposition allows us to use a Cross-encoder QE relevance module for high accuracy.
We also show that pretraining the QE module with auto-generated QE data from user logs can further improve the overall performance.
arXiv Detail & Related papers (2023-07-01T15:44:53Z) - Learning to Retrieve Engaging Follow-Up Queries [12.380514998172199]
We present a retrieval based system and associated dataset for predicting the next questions that the user might have.
Such a system can proactively assist users in knowledge exploration leading to a more engaging dialog.
arXiv Detail & Related papers (2023-02-21T20:26:23Z) - Mixed-modality Representation Learning and Pre-training for Joint
Table-and-Text Retrieval in OpenQA [85.17249272519626]
An optimized OpenQA Table-Text Retriever (OTTeR) is proposed.
We conduct retrieval-centric mixed-modality synthetic pre-training.
OTTeR substantially improves the performance of table-and-text retrieval on the OTT-QA dataset.
arXiv Detail & Related papers (2022-10-11T07:04:39Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - Session-Aware Query Auto-completion using Extreme Multi-label Ranking [61.753713147852125]
We take the novel approach of modeling session-aware query auto-completion as an e Multi-Xtreme Ranking (XMR) problem.
We adapt a popular XMR algorithm for this purpose by proposing several modifications to the key steps in the algorithm.
Our approach meets the stringent latency requirements for auto-complete systems while leveraging session information in making suggestions.
arXiv Detail & Related papers (2020-12-09T17:56:22Z) - Efficient Neural Query Auto Completion [17.58784759652327]
Three major challenges are observed for a query auto completion system.
Traditional QAC systems rely on handcrafted features such as the query candidate frequency in search logs.
We propose an efficient neural QAC system with effective context modeling to overcome these challenges.
arXiv Detail & Related papers (2020-08-06T21:28:36Z) - Conversations with Search Engines: SERP-based Conversational Response
Generation [77.1381159789032]
We create a suitable dataset, the Search as a Conversation (SaaC) dataset, for the development of pipelines for conversations with search engines.
We also develop a state-of-the-art pipeline for conversations with search engines, the Conversations with Search Engines (CaSE) using this dataset.
CaSE enhances the state-of-the-art by introducing a supporting token identification module and aprior-aware pointer generator.
arXiv Detail & Related papers (2020-04-29T13:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.