Related papers: Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM

Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM

URL: http://arxiv.org/abs/2509.02558v1
Date: Tue, 02 Sep 2025 17:53:57 GMT
Title: Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM
Authors: Yijun Ge, Sahel Sharifymoghaddam, Jimmy Lin,
Abstract summary: The BRIGHT benchmark is a dataset consisting of reasoning-intensive queries over diverse domains.<n>We apply listwise reranking with large language models to further investigate the impact of reranking on reasoning-intensive queries.<n>These baselines are integrated into popular retrieval and reranking toolkits Anserini, Pyserini, and RankLLM.
Score: 44.67715098747863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The BRIGHT benchmark is a dataset consisting of reasoning-intensive queries over diverse domains. We explore retrieval results on BRIGHT using a range of retrieval techniques, including sparse, dense, and fusion methods, and establish reproducible baselines. We then apply listwise reranking with large language models (LLMs) to further investigate the impact of reranking on reasoning-intensive queries. These baselines are integrated into popular retrieval and reranking toolkits Anserini, Pyserini, and RankLLM, with two-click reproducibility that makes them easy to build upon and convenient for further development. While attempting to reproduce the results reported in the original BRIGHT paper, we find that the provided BM25 scores differ notably from those that we obtain using Anserini and Pyserini. We discover that this difference is due to BRIGHT's implementation of BM25, which applies BM25 on the query rather than using the standard bag-of-words approach, as in Anserini, to construct query vectors. This difference has become increasingly relevant due to the rise of longer queries, with BRIGHT's lengthy reasoning-intensive queries being a prime example, and further accentuated by the increasing usage of retrieval-augmented generation, where LLM prompts can grow to be much longer than ''traditional'' search engine queries. Our observation signifies that it may be time to reconsider BM25 approaches going forward in order to better accommodate emerging applications. To facilitate this, we integrate query-side BM25 into both Anserini and Pyserini.

Related papers

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents [60.53966065867568]
We introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.<n>We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval.<n> BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries.
arXiv Detail & Related papers (2026-02-05T18:25:24Z)
SpIDER: Spatially Informed Dense Embedding Retrieval for Software Issue Localization [6.098008057625392]
Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify relevant units.<n>We propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that incorporates LLM-based reasoning over auxiliary context.<n> Empirical results show that SpIDER consistently improves dense retrieval performance across several programming languages.
arXiv Detail & Related papers (2025-12-18T01:32:25Z)
Revisiting Feedback Models for HyDE [49.53124785319461]
HyDE is a method that enriches query representations with LLM-generated hypothetical answer documents.<n>Our experiments show that HyDE's effectiveness can be substantially improved when leveraging feedback algorithms such as Rocchio to extract and weight expansion terms.
arXiv Detail & Related papers (2025-11-24T17:50:18Z)
Hint-Augmented Re-ranking: Efficient Product Search using LLM-Based Query Decomposition [20.966359103135762]
We show that LLMs can uncover latent intent behind superlatives in e-commerce queries.<n>Our approach decomposes queries into attribute-value hints generated concurrently with retrieval.<n>Our method improves search performanc eby 10.9 points in MAP and ranking by 5.9 points in MRR over baselines.
arXiv Detail & Related papers (2025-11-17T23:53:25Z)
Rethinking On-policy Optimization for Query Augmentation [49.87723664806526]
We present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks.<n>We introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which learns to generate a pseudo-document that maximizes retrieval performance.
arXiv Detail & Related papers (2025-10-20T04:16:28Z)
Reasoning-enhanced Query Understanding through Decomposition and Interpretation [87.56450566014625]
ReDI is a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation.<n>We compiled a large-scale dataset of real-world complex queries from a major search engine.<n> Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms.
arXiv Detail & Related papers (2025-09-08T10:58:42Z)
InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking [3.1125398490785217]
InsertRank is an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance.<n>With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark, and 51.1 on the R2MED benchmark, surpassing previous methods.
arXiv Detail & Related papers (2025-06-17T01:04:45Z)
Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion [0.0]
Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion.<n>We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers.
arXiv Detail & Related papers (2025-06-05T08:44:34Z)
IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation [24.108631023133704]
IterKey is an iterative keyword generation framework that enhances RAG via sparse retrieval.<n>It achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines.
arXiv Detail & Related papers (2025-05-13T11:25:15Z)
Data Fusion of Synthetic Query Variants With Generative Large Language Models [1.864807003137943]
This work explores the feasibility of using synthetic query variants generated by instruction-tuned Large Language Models in data fusion experiments. We introduce a lightweight, unsupervised, and cost-efficient approach that exploits principled prompting and data fusion techniques. Our analysis shows that data fusion based on synthetic query variants is significantly better than baselines with single queries and also outperforms pseudo-relevance feedback methods.
arXiv Detail & Related papers (2024-11-06T12:54:27Z)
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.<n>Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.<n>We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points.
arXiv Detail & Related papers (2024-07-16T17:58:27Z)
Large Language Models are Strong Zero-Shot Retriever [89.16756291653371]
We propose a simple method that applies a large language model (LLM) to large-scale retrieval in zero-shot scenarios. Our method, the Language language model as Retriever (LameR), is built upon no other neural models but an LLM.
arXiv Detail & Related papers (2023-04-27T14:45:55Z)
Query2doc: Query Expansion with Large Language Models [69.9707552694766]
The proposed method first generates pseudo- documents by few-shot prompting large language models (LLMs) query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets. Our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
arXiv Detail & Related papers (2023-03-14T07:27:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.