Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation
- URL: http://arxiv.org/abs/2409.11860v1
- Date: Wed, 18 Sep 2024 10:30:50 GMT
- Title: Retrieve, Annotate, Evaluate, Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation
- Authors: Kasra Hosseini, Thomas Kober, Josip Krapac, Roland Vollgraf, Weiwei Cheng, Ana Peleteiro Ramallo,
- Abstract summary: Large Language Models (LLMs) have the potential to address this scaling issue.
We propose a framework for assessing the product search engines in a large-scale e-commerce setting.
Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations.
- Score: 3.670782697615276
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations, significantly reduces time and cost, facilitates rapid problem discovery, and provides an effective solution for production-level quality control at scale.
Related papers
- Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search [3.392843594990172]
Traditional approaches for annotating query-product pairs rely on human-based labeling services.
We show that Large Language Models (LLMs) can approach human-level accuracy on this task in a fraction of the time and cost required by human-labelers.
This scalable alternative to human-annotation has significant implications for information retrieval domains.
arXiv Detail & Related papers (2025-02-21T22:59:36Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research [13.818244562506138]
Large Language Models (LLMs) provide a cost-effective and efficient alternative to human annotation.
This paper introduces the SILICON" (Systematic Inference with LLMs for Information Classification and Notation) workflow.
The workflow integrates established principles of human annotation with systematic prompt optimization and model selection.
arXiv Detail & Related papers (2024-12-19T02:21:41Z) - Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [64.1799100754406]
Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more.
Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks.
We present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-11-21T18:59:55Z) - Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions.
We propose a novel method to address this challenge.
We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach [7.996010840316654]
We propose an uncertainty reduction framework using Large Language Models (LLMs) to improve entity resolution results.
LLMs capitalize on their advanced linguistic capabilities and a pay-as-you-go'' model that provides significant advantages to those without extensive data science expertise.
We show that our method is efficient and effective, offering promising applications in real-world tasks.
arXiv Detail & Related papers (2024-01-07T09:06:58Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.