Overview of the TREC 2023 deep learning track
- URL: http://arxiv.org/abs/2507.08890v1
- Date: Thu, 10 Jul 2025 20:39:42 GMT
- Title: Overview of the TREC 2023 deep learning track
- Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, Ian Soboroff,
- Abstract summary: This is the fifth year of the TREC Deep Learning track.<n>We leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available.<n>This year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt.
- Score: 67.56975103581688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year's design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the "nnlm" approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $\tau=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.
Related papers
- Overview of the TREC 2022 deep learning track [67.86242254073656]
This is the fourth year of the TREC Deep Learning track.<n>We leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available.<n>Similar to previous years, deep neural ranking models that employ large scale pretraining continued to outperform traditional retrieval methods.
arXiv Detail & Related papers (2025-07-10T20:48:22Z) - ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge [49.65993318863458]
ImpliRet is a benchmark that shifts the reasoning challenge to document-side processing.<n>We evaluate a range of sparse and dense retrievers, all of which struggle in this setting.
arXiv Detail & Related papers (2025-06-17T11:08:29Z) - Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning [76.50690734636477]
We introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task.<n>Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries.
arXiv Detail & Related papers (2025-03-08T03:14:26Z) - Evaluating LLMs on Entity Disambiguation in Tables [0.9786690381850356]
This work proposes an extensive evaluation of four STI SOTA approaches: Alligator (formerly s-elbat), Dagobah, TURL, and TableLlama.
We also include in the evaluation both GPT-4o and GPT-4o-mini, since they excel in various public benchmarks.
arXiv Detail & Related papers (2024-08-12T18:01:50Z) - An In-Context Learning Agent for Formal Theorem-Proving [10.657173216834668]
We present an in-context learning agent for formal theorem-context in environments like Lean and Coq.
COPRA repeatedly asks a large language model to propose tactic applications from within a stateful backtracking search.
We evaluate our implementation of COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the CompCert project.
arXiv Detail & Related papers (2023-10-06T16:21:22Z) - Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting [65.00288634420812]
Pairwise Ranking Prompting (PRP) is a technique to significantly reduce the burden on Large Language Models (LLMs)
Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs.
arXiv Detail & Related papers (2023-06-30T11:32:25Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers.
To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed.
There are no standard procedure for using this ranking information and Area Chairs may use it in different ways.
We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z) - Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark [4.303515688770516]
This paper introduces an important yet relatively unexplored NLP task called Multi-Narrative Semantic Overlap (MNSO)
We created one by crawling 2,925 narrative pairs from the web and then, went through the tedious process of manually creating 411 different ground-truth semantic overlaps by engaging human annotators.
We formulate a new precision-recall style evaluation metric, called SEM-F1 (semantic F1)
Experimental results show that the proposed SEM-F1 metric yields higher correlation with human judgement as well as higher inter-rater-agreement compared to ROUGE metric.
arXiv Detail & Related papers (2022-01-14T03:56:41Z) - MS MARCO: Benchmarking Ranking Models in the Large-Data Regime [57.37239054770001]
This paper uses the MS MARCO and TREC Deep Learning Track as our case study.
We show how the design of the evaluation effort can encourage or discourage certain outcomes.
We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls.
arXiv Detail & Related papers (2021-05-09T20:57:36Z) - Brown University at TREC Deep Learning 2019 [11.63256359906015]
This paper describes Brown University's submission to the TREC 2019 Deep Learning track.
Brown's team ranked 3rd in the passage retrieval task (including full ranking and re-ranking), and 2nd when considering only re-ranking submissions.
arXiv Detail & Related papers (2020-09-08T22:54:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.