Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment
- URL: http://arxiv.org/abs/2501.14296v1
- Date: Fri, 24 Jan 2025 07:33:39 GMT
- Title: Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment
- Authors: Julian A. Schnabel, Johanne R. Trippas, Falk Scholer, Danula Hettiachchi,
- Abstract summary: We propose a modular classification pipeline that divides the relevance assessment task into multiple stages.
One of our approaches showed an 18.4% Krippendorff's $alpha$ accuracy increase over OpenAI's GPT-4o mini.
- Score: 6.947361774195549
- License:
- Abstract: The effectiveness of search systems is evaluated using relevance labels that indicate the usefulness of documents for specific queries and users. While obtaining these relevance labels from real users is ideal, scaling such data collection is challenging. Consequently, third-party annotators are employed, but their inconsistent accuracy demands costly auditing, training, and monitoring. We propose an LLM-based modular classification pipeline that divides the relevance assessment task into multiple stages, each utilising different prompts and models of varying sizes and capabilities. Applied to TREC Deep Learning (TREC-DL), one of our approaches showed an 18.4% Krippendorff's $\alpha$ accuracy increase over OpenAI's GPT-4o mini while maintaining a cost of about 0.2 USD per million input tokens, offering a more efficient and scalable solution for relevance assessment. This approach beats the baseline performance of GPT-4o (5 USD). With a pipeline approach, even the accuracy of the GPT-4o flagship model, measured in $\alpha$, could be improved by 9.7%.
Related papers
- Label Privacy in Split Learning for Large Models with Parameter-Efficient Training [51.28799334394279]
We search for a way to fine-tune models over an API while keeping the labels private.
We propose P$3$EFT, a multi-party split learning algorithm that takes advantage of existing PEFT properties to maintain privacy at a lower performance overhead.
arXiv Detail & Related papers (2024-12-21T15:32:03Z) - GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data [12.13180744190893]
GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale.
We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1% of the cost.
arXiv Detail & Related papers (2024-10-03T17:58:29Z) - Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties.
We compare the performance of various language models on the Sustainable Development Goal mapping task.
According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z) - Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation [56.49084589053732]
Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain.
This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale textitad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion.
arXiv Detail & Related papers (2024-08-02T16:15:25Z) - In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs
Miss [4.8384738694883955]
We introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts.
Fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $11times 106$ elements.
This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date.
arXiv Detail & Related papers (2024-02-16T16:15:01Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs.
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system.
GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z) - Split and Merge: Aligning Position Biases in LLM-based Evaluators [22.265542509143756]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - What Makes Good In-Context Examples for GPT-$3$? [101.99751777056314]
GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks.
Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples.
In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples.
arXiv Detail & Related papers (2021-01-17T23:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.