Model Hubs and Beyond: Analyzing Model Popularity, Performance, and Documentation
- URL: http://arxiv.org/abs/2503.15222v2
- Date: Mon, 07 Apr 2025 11:25:35 GMT
- Title: Model Hubs and Beyond: Analyzing Model Popularity, Performance, and Documentation
- Authors: Pritam Kadasi, Sriman Reddy Kondam, Srivathsa Vamsi Chaturvedula, Rudranshu Sen, Agnish Saha, Soumavo Sikdar, Sayani Sarkar, Suhani Mittal, Rohit Jindal, Mayank Singh,
- Abstract summary: We evaluate a comprehensive set of 500 Sentiment Analysis models on Hugging Face.<n>Our findings reveal that model popularity does not necessarily correlate with performance.<n>About 88% of model authors overstate their models' performance in the model cards.
- Score: 1.2888930658406668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the massive surge in ML models on platforms like Hugging Face, users often lose track and struggle to choose the best model for their downstream tasks, frequently relying on model popularity indicated by download counts, likes, or recency. We investigate whether this popularity aligns with actual model performance and how the comprehensiveness of model documentation correlates with both popularity and performance. In our study, we evaluated a comprehensive set of 500 Sentiment Analysis models on Hugging Face. This evaluation involved massive annotation efforts, with human annotators completing nearly 80,000 annotations, alongside extensive model training and evaluation. Our findings reveal that model popularity does not necessarily correlate with performance. Additionally, we identify critical inconsistencies in model card reporting: approximately 80% of the models analyzed lack detailed information about the model, training, and evaluation processes. Furthermore, about 88% of model authors overstate their models' performance in the model cards. Based on our findings, we provide a checklist of guidelines for users to choose good models for downstream tasks.
Related papers
- Model Provenance Testing for Large Language Models [14.949325775620439]
We develop a framework for testing model provenance: Whether one model is derived from another.<n>Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs.<n>Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models.
arXiv Detail & Related papers (2025-02-02T07:39:37Z) - CGI: Identifying Conditional Generative Models with Example Images [14.453885742032481]
Generative models have achieved remarkable performance recently, and thus model hubs have emerged.<n>It is not easy for users to review model descriptions and example images, choosing which model best meets their needs.<n>We propose Generative Model Identification (CGI), which aims to identify the most suitable model using user-provided example images.
arXiv Detail & Related papers (2025-01-23T09:31:06Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - What's documented in AI? Systematic Analysis of 32K AI Model Cards [40.170354637778345]
We conduct a comprehensive analysis of 32,111 AI model documentations on Hugging Face.
Most of the AI models with substantial downloads provide model cards, though the cards have uneven informativeness.
We find that sections addressing environmental impact, limitations, and evaluation exhibit the lowest filled-out rates, while the training section is the most consistently filled-out.
arXiv Detail & Related papers (2024-02-07T18:04:32Z) - Has Your Pretrained Model Improved? A Multi-head Posterior Based
Approach [25.927323251675386]
We leverage the meta-features associated with each entity as a source of worldly knowledge and employ entity representations from the models.
We propose using the consistency between these representations and the meta-features as a metric for evaluating pre-trained models.
Our method's effectiveness is demonstrated across various domains, including models with relational datasets, large language models and image models.
arXiv Detail & Related papers (2024-01-02T17:08:26Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Revealing the Underlying Patterns: Investigating Dataset Similarity,
Performance, and Generalization [0.0]
Supervised deep learning models require significant amount of labeled data to achieve an acceptable performance on a specific task.
We establish image-image, dataset-dataset, and image-dataset distances to gain insights into the model's behavior.
arXiv Detail & Related papers (2023-08-07T13:35:53Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing [83.63107444454938]
We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity.
Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
arXiv Detail & Related papers (2022-04-13T19:54:51Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.