Related papers: Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

URL: http://arxiv.org/abs/2410.19974v1
Date: Fri, 25 Oct 2024 21:29:04 GMT
Title: Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements
Authors: Silvia Terragni, Hoang Cuong, Joachim Daiber, Pallavi Gudipati, Pablo N. Mendes,
Abstract summary: Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. There is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context.
Score: 1.6637373649145606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

Related papers

A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research. multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z)
One for All: A General Framework of LLMs-based Multi-Criteria Decision Making on Human Expert Level [7.755152930120769]
We propose an evaluation framework to automatically deal with general complex MCDM problems. Within the framework, we assess the performance of various typical open-source models, as well as commercial models such as Claude and ChatGPT. The experimental results show that the accuracy rates for different applications improve significantly to around 95%, and the performance difference is trivial between different models.
arXiv Detail & Related papers (2025-02-17T06:47:20Z)
Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets. We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance. Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z)
OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations. We identify and review the varying factors in evaluation practices adopted by the community. OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models [56.256069117502385]
Chain of Thought (CoT) approaches can be used to enhance the capability of Large Language Models (LLMs) on complex reasoning tasks. However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored. We introduce a novel approach that addresses this challenge by using retrieval mechanisms to automatically select demonstration examples.
arXiv Detail & Related papers (2023-12-04T08:07:21Z)
An Empirical Investigation into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification [0.8702432681310401]
This paper offers a one-stop empirical benchmark of multiplicity across various dimensions of model design. We also develop a framework, which we call multiplicity sheets, to benchmark multiplicity in various scenarios. We show that multiplicity persists in deep learning models even after enforcing additional specifications during model selection.
arXiv Detail & Related papers (2023-11-24T22:30:38Z)
MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z)
On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches. We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions [27.983902791798965]
We develop a model that generates dilution text that maintains relevance and topical coherence with the image and existing text. We find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model. Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations.
arXiv Detail & Related papers (2022-11-04T17:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.