MS MARCO: Benchmarking Ranking Models in the Large-Data Regime
- URL: http://arxiv.org/abs/2105.04021v1
- Date: Sun, 9 May 2021 20:57:36 GMT
- Title: MS MARCO: Benchmarking Ranking Models in the Large-Data Regime
- Authors: Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos and Jimmy
Lin
- Abstract summary: This paper uses the MS MARCO and TREC Deep Learning Track as our case study.
We show how the design of the evaluation effort can encourage or discourage certain outcomes.
We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls.
- Score: 57.37239054770001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public
leaderboard such as MS MARCO, are intended to encourage research and track our
progress, addressing big questions in our field. However, the goal is not
simply to identify which run is "best", achieving the top score. The goal is to
move the field forward by developing new robust techniques, that work in many
different settings, and are adopted in research and practice. This paper uses
the MS MARCO and TREC Deep Learning Track as our case study, comparing it to
the case of TREC ad hoc ranking in the 1990s. We show how the design of the
evaluation effort can encourage or discourage certain outcomes, and raising
questions about internal and external validity of results. We provide some
analysis of certain pitfalls, and a statement of best practices for avoiding
such pitfalls. We summarize the progress of the effort so far, and describe our
desired end state of "robust usefulness", along with steps that might be
required to get us there.
Related papers
- Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [90.23629291067763]
A promising approach for improving reasoning in large language models is to use process reward models (PRMs)
PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs)
To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?"
We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL.
arXiv Detail & Related papers (2024-10-10T17:31:23Z) - Goal-Reaching Policy Learning from Non-Expert Observations via Effective Subgoal Guidance [46.8322564551124]
We propose a novel subgoal guidance learning strategy.
We develop a diffusion strategy-based high-level policy to generate reasonable subgoals as waypoints.
We evaluate our method on complex robotic navigation and manipulation tasks.
arXiv Detail & Related papers (2024-09-06T02:49:12Z) - Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation [3.5490824406092405]
offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications.
The current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols.
arXiv Detail & Related papers (2024-06-13T12:54:29Z) - A Survey on Deep Active Learning: Recent Advances and New Frontiers [27.07154361976248]
This work aims to serve as a useful and quick guide for researchers in overcoming difficulties in deep learning-based active learning (DAL)
This technique has gained increasing popularity due to its broad applicability, yet its survey papers, especially for deep learning-based active learning (DAL), remain scarce.
arXiv Detail & Related papers (2024-05-01T05:54:33Z) - How much can change in a year? Revisiting Evaluation in Multi-Agent
Reinforcement Learning [4.653136482223517]
We extend the database of evaluation methodology previously published by containing meta-data on MARL publications from top-rated conferences.
We compare the findings extracted from this updated database to the trends identified in their work.
We do observe a trend towards more difficult scenarios in SMAC-v1, which if continued into SMAC-v2 will encourage novel algorithmic development.
arXiv Detail & Related papers (2023-12-13T19:06:34Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics.
We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs.
Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z) - Towards a Standardised Performance Evaluation Protocol for Cooperative
MARL [2.2977300225306583]
Multi-agent reinforcement learning (MARL) has emerged as a useful approach to solving decentralised decision-making problems at scale.
We take a closer look at this rapid development with a focus on evaluation methodologies employed across a large body of research in cooperative MARL.
We propose a standardised performance evaluation protocol for cooperative MARL.
arXiv Detail & Related papers (2022-09-21T16:40:03Z) - Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers.
To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed.
There are no standard procedure for using this ranking information and Area Chairs may use it in different ways.
We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.