Related papers: BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

URL: http://arxiv.org/abs/2407.12883v1
Date: Tue, 16 Jul 2024 17:58:27 GMT
Title: BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
Authors: Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, Tao Yu,
Abstract summary: Many complex real-world queries require in-depth reasoning to identify relevant documents. We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. brightbenchmark is constructed from the 1,398 real-world queries collected from diverse domains.
Score: 54.54576644403115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing retrieval benchmarks primarily consist of information-seeking queries (e.g., aggregated questions from search engines) where keyword or semantic-based retrieval is usually sufficient. However, many complex real-world queries require in-depth reasoning to identify relevant documents that go beyond surface form matching. For example, finding documentation for a coding question requires understanding the logic and syntax of the functions involved. To better benchmark retrieval on such challenging queries, we introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. BRIGHT is constructed from the 1,398 real-world queries collected from diverse domains (such as economics, psychology, robotics, software engineering, earth sciences, etc.), sourced from naturally occurring or carefully curated human data. Extensive evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT. The leading model on the MTEB leaderboard [38 ], which achieves a score of 59.0 nDCG@10,2 produces a score of nDCG@10 of 18.0 on BRIGHT. We further demonstrate that augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. Moreover, BRIGHT is robust against data leakage during pretraining of the benchmarked models as we validate by showing similar performance even when documents from the benchmark are included in the training data. We believe that BRIGHT paves the way for future research on retrieval systems in more realistic and challenging settings. Our code and data are available at https://brightbenchmark.github.io.

Related papers

ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [49.65993318863458]
ImpliRet is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z)
Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents [17.506934704019226]
standardized documents share similar formats such as repetitive boilerplate texts, and similar table structures.<n>This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness.<n>We propose the Hierarchical Retrieval with Evidence Curation framework to address these issues.
arXiv Detail & Related papers (2025-05-26T11:08:23Z)
Large Language Model Can Be a Foundation for Hidden Rationale-Based Retrieval [12.83513794686623]
In this paper, we propose and study a more challenging type of retrieval task, called hidden rationale retrieval. To address such problems, an instruction-tuned Large language model (LLM) with a cross-encoder architecture could be a reasonable choice. We name this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned performance superiority on Emotional Support Conversation (ESC)
arXiv Detail & Related papers (2024-12-21T13:19:15Z)
Disentangling Questions from Query Generation for Task-Adaptive Retrieval [22.86406485412172]
We propose EGG, a query generator that better adapts to wide search intents expressed in the BeIR benchmark. Our method outperforms baselines and existing models on four tasks with underexplored intents, while utilizing a query generator 47 times smaller than the previous state-of-the-art.
arXiv Detail & Related papers (2024-09-25T02:53:27Z)
ExcluIR: Exclusionary Neural Information Retrieval [74.08276741093317]
We present ExcluIR, a set of resources for exclusionary retrieval. evaluation benchmark includes 3,452 high-quality exclusionary queries. training set contains 70,293 exclusionary queries, each paired with a positive document and a negative document.
arXiv Detail & Related papers (2024-04-26T09:43:40Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
Ask Optimal Questions: Aligning Large Language Models with Retriever's Preference in Conversational Search [25.16282868262589]
RetPO is designed to optimize a language model (LM) for reformulating search queries in line with the preferences of the target retrieval systems. We construct a large-scale dataset called Retrievers' Feedback on over 410K query rewrites across 12K conversations. The resulting model achieves state-of-the-art performance on two recent conversational search benchmarks.
arXiv Detail & Related papers (2024-02-19T04:41:31Z)
Decomposing Complex Queries for Tip-of-the-tongue Retrieval [72.07449449115167]
Complex queries describe content elements (e.g., book characters or events), information beyond the document text. This retrieval setting, called tip of the tongue (TOT), is especially challenging for models reliant on lexical and semantic overlap between query and document text. We introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results.
arXiv Detail & Related papers (2023-05-24T11:43:40Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs) query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets. Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z)
CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z)
Decoding a Neural Retriever's Latent Space for Query Suggestion [28.410064376447718]
We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion.
arXiv Detail & Related papers (2022-10-21T16:19:31Z)
Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback [29.719150565643965]
This paper proposes ANCE-PRF, a new query encoder that uses pseudo relevance feedback (PRF) to improve query representations for dense retrieval. ANCE-PRF uses a BERT encoder that consumes the query and the top retrieved documents from a dense retrieval model, ANCE, and it learns to produce better query embeddings directly from relevance labels. Analysis shows that the PRF encoder effectively captures the relevant and complementary information from PRF documents, while ignoring the noise with its learned attention mechanism.
arXiv Detail & Related papers (2021-08-30T18:10:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.