Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents
- URL: http://arxiv.org/abs/2304.09542v2
- Date: Fri, 27 Oct 2023 12:11:16 GMT
- Title: Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents
- Authors: Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren,
Zhumin Chen, Dawei Yin, Zhaochun Ren
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
- Score: 56.104476412839944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable zero-shot
generalization across various language-related tasks, including search engines.
However, existing work utilizes the generative ability of LLMs for Information
Retrieval (IR) rather than direct passage ranking. The discrepancy between the
pre-training objectives of LLMs and the ranking objective poses another
challenge. In this paper, we first investigate generative LLMs such as ChatGPT
and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal
that properly instructed LLMs can deliver competitive, even superior results to
state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to
address concerns about data contamination of LLMs, we collect a new test set
called NovelEval, based on the latest knowledge and aiming to verify the
model's ability to rank unknown knowledge. Finally, to improve efficiency in
real-world applications, we delve into the potential for distilling the ranking
capabilities of ChatGPT into small specialized models using a permutation
distillation scheme. Our evaluation results turn out that a distilled 440M
model outperforms a 3B supervised model on the BEIR benchmark. The code to
reproduce our results is available at www.github.com/sunnweiwei/RankGPT.
Related papers
- Best Practices for Distilling Large Language Models into BERT for Web Search Ranking [14.550458167328497]
Large Language Models (LLMs) can generate a ranked list of potential documents.
We transfer the ranking expertise of LLMs to a more compact model like BERT, using a ranking loss to enable the deployment of less resource-intensive models.
Our model has been successfully integrated into a commercial web search engine as of February 2024.
arXiv Detail & Related papers (2024-11-07T08:54:46Z) - Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators [22.567933207841968]
Large Language Models (LLMs) and AI assistants are experiencing exponential growth in usage among both expert and amateur users.
In this work, we focus on evaluating the reliability of current LLMs as science communicators.
We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts.
arXiv Detail & Related papers (2024-09-21T06:48:32Z) - RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs [60.38044044203333]
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG)
We propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG.
For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks.
arXiv Detail & Related papers (2024-07-02T17:59:17Z) - Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models [5.0490573482829335]
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities.
This paper investigates the use of a pre-filtering step before passage re-ranking in information retrieval (IR)
Our experiments show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task.
arXiv Detail & Related papers (2024-06-26T20:12:24Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Evaluating Large Language Models for Health-Related Text Classification Tasks with Public Social Media Data [3.9459077974367833]
Large language models (LLMs) have demonstrated remarkable success in NLP tasks.
We benchmarked one supervised classic machine learning model based on Support Vector Machines (SVMs), three supervised pretrained language models (PLMs) based on RoBERTa, BERTweet, and SocBERT, and two LLM based classifiers (GPT3.5 and GPT4), across 6 text classification tasks.
Our comprehensive experiments demonstrate that employ-ing data augmentation using LLMs (GPT-4) with relatively small human-annotated data to train lightweight supervised classification models achieves superior results compared to training with human-annotated data
arXiv Detail & Related papers (2024-03-27T22:05:10Z) - Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models [69.51130760097818]
We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function.
We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks.
arXiv Detail & Related papers (2023-11-15T04:40:43Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Large Language Models are Zero-Shot Rankers for Recommender Systems [76.02500186203929]
This work aims to investigate the capacity of large language models (LLMs) to act as the ranking model for recommender systems.
We show that LLMs have promising zero-shot ranking abilities but struggle to perceive the order of historical interactions.
We demonstrate that these issues can be alleviated using specially designed prompting and bootstrapping strategies.
arXiv Detail & Related papers (2023-05-15T17:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.