Rank-without-GPT: Building GPT-Independent Listwise Rerankers on
Open-Source Large Language Models
- URL: http://arxiv.org/abs/2312.02969v1
- Date: Tue, 5 Dec 2023 18:57:40 GMT
- Title: Rank-without-GPT: Building GPT-Independent Listwise Rerankers on
Open-Source Large Language Models
- Authors: Xinyu Zhang, Sebastian Hofst\"atter, Patrick Lewis, Raphael Tang,
Jimmy Lin
- Abstract summary: Listwise rerankers based on large language models (LLM) are the zero-shot state-of-art.
In this work, we build for the first time effective listwise rerankers without any form of dependency on GPT.
Our best list se reranker surpasses the listwise rerankers based on GPT-3.5 by 13% and achieves 97% effectiveness of the ones built on GPT-4.
- Score: 59.52207546810294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Listwise rerankers based on large language models (LLM) are the zero-shot
state-of-the-art. However, current works in this direction all depend on the
GPT models, making it a single point of failure in scientific reproducibility.
Moreover, it raises the concern that the current research findings only hold
for GPT models but not LLM in general. In this work, we lift this pre-condition
and build for the first time effective listwise rerankers without any form of
dependency on GPT. Our passage retrieval experiments show that our best list se
reranker surpasses the listwise rerankers based on GPT-3.5 by 13% and achieves
97% effectiveness of the ones built on GPT-4. Our results also show that the
existing training datasets, which were expressly constructed for pointwise
ranking, are insufficient for building such listwise rerankers. Instead,
high-quality listwise ranking data is required and crucial, calling for further
work on building human-annotated listwise data resources.
Related papers
- Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance [11.595274304409937]
Large language models (LLMs) have revolutionized zero-shot task performance.
Current methods using trigger phrases such as "Let's think step by step" remain limited.
This study introduces PRomPTed, an approach that optimize the zero-shot prompts for individual task instances.
arXiv Detail & Related papers (2023-10-03T14:51:34Z) - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond [29.778018058541676]
GPT-Fathom is an open-source and reproducible evaluation suite for large language models (LLMs) built on top of OpenAI Evals.
We evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all aligned under settings.
arXiv Detail & Related papers (2023-09-28T16:43:35Z) - RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large
Language Models [56.51705482912727]
We present RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting.
Experimental results on the TREC 2019 and 2020 Deep Learning Tracks show that we can achieve effectiveness comparable to zero-shot reranking with GPT-3.5 with a much smaller 7B parameter model, although our effectiveness remains slightly behind reranking with GPT-4.
arXiv Detail & Related papers (2023-09-26T17:31:57Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - GPT-RE: In-context Learning for Relation Extraction using Large Language
Models [43.968903620208444]
GPT-RE bridges the gap between large language models and fully-supervised baselines in relation extraction.
We evaluate GPT-RE on four widely-used RE datasets, and observe that GPT-RE achieves improvements over existing GPT-3 baselines.
arXiv Detail & Related papers (2023-05-03T13:28:08Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - Shall We Pretrain Autoregressive Language Models with Retrieval? A
Comprehensive Study [115.96080028033904]
We study a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT.
Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.
arXiv Detail & Related papers (2023-04-13T18:04:19Z) - Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models.
Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.