Related papers: Can We Predict Performance of Large Models across Vision-Language Tasks?

Can We Predict Performance of Large Models across Vision-Language Tasks?

URL: http://arxiv.org/abs/2410.10112v1
Date: Mon, 14 Oct 2024 03:00:12 GMT
Title: Can We Predict Performance of Large Models across Vision-Language Tasks?
Authors: Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould,
Abstract summary: We propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We use a sparse performance matrix $boldsymbolR$, where each entry $R_mn$ represents the performance score of the $m$-th model on the $n$-th dataset. We demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
Score: 34.27319941609499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating large vision-language models (LVLMs) is very expensive, due to the high computational costs and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, that is, predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, quickly reducing errors in performance prediction. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. In experiments, we systematically evaluate 108 LVLMs on 176 datasets from 36 benchmarks, constructing training and testing sets for validating our framework. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.

Related papers

A comparative analysis of machine learning algorithms for predicting probabilities of default [1.534667887016089]
Predicting the probability of default (PD) of prospective loans is a critical objective for financial institutions.<n>In recent years, machine learning (ML) algorithms have achieved remarkable success across a wide variety of prediction tasks.<n>This paper highlights the opportunities that ML algorithms offer to this field by comparing the performance of five predictive models.
arXiv Detail & Related papers (2025-06-24T16:56:07Z)
Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [32.04523360747506]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations. We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z)
Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective [5.09611816929943]
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance.<n>Current prediction methods lack accuracy and reliability.<n>We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction.
arXiv Detail & Related papers (2025-02-24T15:44:57Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably. This poses a significant challenge to ensuring their safe deployment. We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs [10.494477811252034]
Fine-tuning large language models can lead to textitfine-tuning multiplicity, where equally well-performing models make conflicting predictions on the same inputs. This raises critical concerns about the robustness and reliability of Tabular LLMs. This work proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining.
arXiv Detail & Related papers (2024-07-04T22:22:09Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Automated Efficient Estimation using Monte Carlo Efficient Influence Functions [5.1689445482852765]
This paper introduces textitMonte Carlo Efficient Influence Functions (MC-EIF) MC-EIF is a fully automated technique for approximating efficient influence functions. We prove that MC-EIF is consistent, and that estimators using MC-EIF achieve optimal $sqrtN$ convergence rates.
arXiv Detail & Related papers (2024-02-29T22:19:46Z)
Measuring the Driving Forces of Predictive Performance: Application to Credit Scoring [0.0]
In credit scoring, machine learning models are known to outperform standard parametric models. We introduce the XPER methodology to decompose a performance metric into contributions associated with a model. We show that a small number of features can explain a surprisingly large part of the model performance.
arXiv Detail & Related papers (2022-12-12T13:09:46Z)
Useful Confidence Measures: Beyond the Max Score [9.189382034558657]
We derive several confidence measures that depend on information beyond the maximum score. We show that when models are evaluated on the out-of-distribution data out of the box'', using only the maximum score to inform the confidence measure is highly suboptimal.
arXiv Detail & Related papers (2022-10-25T14:54:44Z)
Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks. Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients. We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression [51.770998056563094]
Probabilistic Gradient Boosting Machines (PGBM) is a method to create probabilistic predictions with a single ensemble of decision trees. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-03T08:32:13Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models. We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.