Multiview Identifiers Enhanced Generative Retrieval
- URL: http://arxiv.org/abs/2305.16675v1
- Date: Fri, 26 May 2023 06:50:21 GMT
- Title: Multiview Identifiers Enhanced Generative Retrieval
- Authors: Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
- Abstract summary: generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
- Score: 78.38443356800848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instead of simply matching a query to pre-existing passages, generative
retrieval generates identifier strings of passages as the retrieval target. At
a cost, the identifier must be distinctive enough to represent a passage.
Current approaches use either a numeric ID or a text piece (such as a title or
substrings) as the identifier. However, these identifiers cannot cover a
passage's content well. As such, we are motivated to propose a new type of
identifier, synthetic identifiers, that are generated based on the content of a
passage and could integrate contextualized information that text pieces lack.
Furthermore, we simultaneously consider multiview identifiers, including
synthetic identifiers, titles, and substrings. These views of identifiers
complement each other and facilitate the holistic ranking of passages from
multiple perspectives. We conduct a series of experiments on three public
datasets, and the results indicate that our proposed approach performs the best
in generative retrieval, demonstrating its effectiveness and robustness.
Related papers
- Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner.
Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval.
We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z) - Summarization-Based Document IDs for Generative Retrieval with Language Models [65.11811787587403]
We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases.
We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively.
We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO.
arXiv Detail & Related papers (2023-11-14T23:28:36Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - Character Queries: A Transformer-based Approach to On-Line Handwritten
Character Segmentation [4.128716153761773]
We focus on the scenario where the transcription is known beforehand, in which case the character segmentation becomes an assignment problem.
Inspired by the $k$-means clustering algorithm, we view it from the perspective of cluster assignment and present a Transformer-based architecture.
In order to assess the quality of our approach, we create character segmentation ground truths for two popular on-line handwriting datasets.
arXiv Detail & Related papers (2023-09-06T15:19:04Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z) - OCoR: An Overlapping-Aware Code Retriever [15.531119719750807]
Given a natural language description, code retrieval aims to search for the most relevant code among a set of code.
Existing state-of-the-art approaches apply neural networks to code retrieval.
We propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps.
arXiv Detail & Related papers (2020-08-12T09:43:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.