COIL: Revisit Exact Lexical Match in Information Retrieval with
Contextualized Inverted List
- URL: http://arxiv.org/abs/2104.07186v1
- Date: Thu, 15 Apr 2021 00:53:54 GMT
- Title: COIL: Revisit Exact Lexical Match in Information Retrieval with
Contextualized Inverted List
- Authors: Luyu Gao, Zhuyun Dai, Jamie Callan
- Abstract summary: COIL is a contextualized exact match retrieval architecture that brings semantic lexical matching.
COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.
- Score: 19.212507277554415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical information retrieval systems such as BM25 rely on exact lexical
match and carry out search efficiently with inverted list index. Recent neural
IR models shifts towards soft semantic matching all query document terms, but
they lose the computation efficiency of exact match systems. This paper
presents COIL, a contextualized exact match retrieval architecture that brings
semantic lexical matching. COIL scoring is based on overlapping query document
tokens' contextualized representations. The new architecture stores
contextualized token representations in inverted lists, bringing together the
efficiency of exact match and the representation power of deep language models.
Our experimental results show COIL outperforms classical lexical retrievers and
state-of-the-art deep LM retrievers with similar or smaller latency.
Related papers
- Evaluating the impact of word embeddings on similarity scoring in practical information retrieval [0.5872014229110214]
Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing pipelines.<n>This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings.
arXiv Detail & Related papers (2026-02-05T14:57:38Z) - X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning [23.9465771255843]
This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning.<n>We first expand the existing benchmarks with additional video annotations to support semantic understanding.<n>X-CoT empirically improves the retrieval performance and produces detailed rationales.
arXiv Detail & Related papers (2025-09-25T20:39:45Z) - Chain of Retrieval: Multi-Aspect Iterative Search Expansion and Post-Order Search Aggregation for Full Paper Retrieval [68.71038700559195]
Chain of Retrieval(COR) is a novel iterative framework for full-paper retrieval.<n>We present SCIBENCH, a benchmark providing both complete and segmented contexts of full papers for queries and candidates.
arXiv Detail & Related papers (2025-07-14T08:41:53Z) - Enhancing Retrieval Systems with Inference-Time Logical Reasoning [9.526027847179677]
We propose a novel inference-time logical reasoning framework that explicitly incorporates logical reasoning into the retrieval process.
Our method extracts logical reasoning structures from natural language queries and then composes the individual cosine similarity scores to formulate the final document scores.
arXiv Detail & Related papers (2025-03-22T20:40:18Z) - GeAR: Generation Augmented Retrieval [82.20696567697016]
Document retrieval techniques form the foundation for the development of large-scale information systems.
The prevailing methodology is to construct a bi-encoder and compute the semantic similarity.
We propose a new method called $textbfGe$neration that incorporates well-designed fusion and decoding modules.
arXiv Detail & Related papers (2025-01-06T05:29:00Z) - Aligning Query Representation with Rewritten Query and Relevance Judgments in Conversational Search [32.35446999027349]
We leverage both rewritten queries and relevance judgments in the conversational search data to train a better query representation model.
The proposed model -- Query Representation Alignment Conversational Retriever, QRACDR, is tested on eight datasets.
arXiv Detail & Related papers (2024-07-29T17:14:36Z) - Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations [8.796275989527054]
We propose a novel organization of the inverted index that enables fast retrieval over learned sparse embeddings.
Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector.
Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions.
arXiv Detail & Related papers (2024-04-29T15:49:27Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - Precise Zero-Shot Dense Retrieval without Relevance Labels [60.457378374671656]
Hypothetical Document Embeddings(HyDE) is a zero-shot dense retrieval system.
We show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever.
arXiv Detail & Related papers (2022-12-20T18:09:52Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Match-Ignition: Plugging PageRank into Transformer for Long-form Text
Matching [66.71886789848472]
We propose a novel hierarchical noise filtering model, namely Match-Ignition, to tackle the effectiveness and efficiency problem.
The basic idea is to plug the well-known PageRank algorithm into the Transformer, to identify and filter both sentence and word level noisy information.
Noisy sentences are usually easy to detect because the sentence is the basic unit of a long-form text, so we directly use PageRank to filter such information.
arXiv Detail & Related papers (2021-01-16T10:34:03Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.