Multiview Identifiers Enhanced Generative Retrieval
- URL: http://arxiv.org/abs/2305.16675v1
- Date: Fri, 26 May 2023 06:50:21 GMT
- Title: Multiview Identifiers Enhanced Generative Retrieval
- Authors: Yongqi Li, Nan Yang, Liang Wang, Furu Wei, Wenjie Li
- Abstract summary: generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
- Score: 78.38443356800848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instead of simply matching a query to pre-existing passages, generative
retrieval generates identifier strings of passages as the retrieval target. At
a cost, the identifier must be distinctive enough to represent a passage.
Current approaches use either a numeric ID or a text piece (such as a title or
substrings) as the identifier. However, these identifiers cannot cover a
passage's content well. As such, we are motivated to propose a new type of
identifier, synthetic identifiers, that are generated based on the content of a
passage and could integrate contextualized information that text pieces lack.
Furthermore, we simultaneously consider multiview identifiers, including
synthetic identifiers, titles, and substrings. These views of identifiers
complement each other and facilitate the holistic ranking of passages from
multiple perspectives. We conduct a series of experiments on three public
datasets, and the results indicate that our proposed approach performs the best
in generative retrieval, demonstrating its effectiveness and robustness.
Related papers
- Generative Retrieval as Multi-Vector Dense Retrieval [71.75503049199897]
Generative retrieval generates identifiers of relevant documents in an end-to-end manner.
Prior work has demonstrated that generative retrieval with atomic identifiers is equivalent to single-vector dense retrieval.
We show that generative retrieval and multi-vector dense retrieval share the same framework for measuring the relevance to a query of a document.
arXiv Detail & Related papers (2024-03-31T13:29:43Z) - Summarization-Based Document IDs for Generative Retrieval with Language Models [65.11811787587403]
We introduce summarization-based document IDs, in which each document's ID is composed of an extractive summary or abstractive keyphrases.
We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively.
We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO.
arXiv Detail & Related papers (2023-11-14T23:28:36Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - Character Queries: A Transformer-based Approach to On-Line Handwritten
Character Segmentation [4.128716153761773]
We focus on the scenario where the transcription is known beforehand, in which case the character segmentation becomes an assignment problem.
Inspired by the $k$-means clustering algorithm, we view it from the perspective of cluster assignment and present a Transformer-based architecture.
In order to assess the quality of our approach, we create character segmentation ground truths for two popular on-line handwriting datasets.
arXiv Detail & Related papers (2023-09-06T15:19:04Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - Identity Documents Authentication based on Forgery Detection of
Guilloche Pattern [2.606834301724095]
An authentication model for identity documents based on forgery detection of guilloche patterns is proposed.
Experiments are conducted in order to analyze and identify the most proper parameters to achieve higher authentication performance.
arXiv Detail & Related papers (2022-06-22T11:37:10Z) - Identity-Aware Multi-Sentence Video Description [105.13845996039277]
We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips.
One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model.
Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works.
arXiv Detail & Related papers (2020-08-22T09:50:43Z) - OCoR: An Overlapping-Aware Code Retriever [15.531119719750807]
Given a natural language description, code retrieval aims to search for the most relevant code among a set of code.
Existing state-of-the-art approaches apply neural networks to code retrieval.
We propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps.
arXiv Detail & Related papers (2020-08-12T09:43:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.