Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset
- URL: http://arxiv.org/abs/2405.13185v1
- Date: Tue, 21 May 2024 20:26:17 GMT
- Title: Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset
- Authors: Claudio Di Sipio, Riccardo Rubei, Juri Di Rocco, Davide Di Ruscio, Phuong T. Nguyen,
- Abstract summary: Software engineering activities have been revolutionized by the advent of pre-trained models (PTMs)
The Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models.
This paper introduces an approach to enable the automatic classification of PTMs for SE tasks.
- Score: 9.218130273952383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software engineering (SE) activities have been revolutionized by the advent of pre-trained models (PTMs), defined as large machine learning (ML) models that can be fine-tuned to perform specific SE tasks. However, users with limited expertise may need help to select the appropriate model for their current task. To tackle the issue, the Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. Nevertheless, the platform currently lacks a comprehensive categorization of PTMs designed specifically for SE, i.e., the existing tags are more suited to generic ML categories. This paper introduces an approach to address this gap by enabling the automatic classification of PTMs for SE tasks. First, we utilize a public dump of HF to extract PTMs information, including model documentation and associated tags. Then, we employ a semi-automated method to identify SE tasks and their corresponding PTMs from existing literature. The approach involves creating an initial mapping between HF tags and specific SE tasks, using a similarity-based strategy to identify PTMs with relevant tags. The evaluation shows that model cards are informative enough to classify PTMs considering the pipeline tag. Moreover, we provide a mapping between SE tasks and stored PTMs by relying on model names.
Related papers
- PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in
Open-Source Software [6.243303627949341]
This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs.
The dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use.
Our analysis provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation.
arXiv Detail & Related papers (2024-02-01T15:55:50Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Naming Practices of Pre-Trained Models in Hugging Face [4.956536094440504]
Pre-Trained Models (PTMs) are used in computer systems to adapt for quality or performance prior to deployment.
Researchers publish PTMs, which engineers adapt for quality or performance prior to deployment.
Prior research has reported that model names are not always well chosen - and are sometimes erroneous.
In this paper, we frame and conduct the first empirical investigation of PTM naming practices in the Hugging Face PTM registry.
arXiv Detail & Related papers (2023-10-02T21:13:32Z) - Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need [84.3507610522086]
Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones.
Recent pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL.
We argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring.
arXiv Detail & Related papers (2023-03-13T17:59:02Z) - A Deep Model for Partial Multi-Label Image Classification with Curriculum Based Disambiguation [42.0958430465578]
We study the partial multi-label (PML) image classification problem.
Existing PML methods typically design a disambiguation strategy to filter out noisy labels.
We propose a deep model for PML to enhance the representation and discrimination ability.
arXiv Detail & Related papers (2022-07-06T02:49:02Z) - AutoGeoLabel: Automated Label Generation for Geospatial Machine Learning [69.47585818994959]
We evaluate a big data processing pipeline to auto-generate labels for remote sensing data.
We utilize the big geo-data platform IBM PAIRS to dynamically generate such labels in dense urban areas.
arXiv Detail & Related papers (2022-01-31T20:02:22Z) - PLATINUM: Semi-Supervised Model Agnostic Meta-Learning using Submodular
Mutual Information [3.1845305066053347]
Few-shot classification (FSC) requires training models using a few (typically one to five) data points per class.
We propose PLATINUM, a novel semi-supervised model agnostic meta-learning framework that uses the submodular mutual information (SMI) functions to boost the performance of FSC.
arXiv Detail & Related papers (2022-01-30T22:07:17Z) - Black-Box Tuning for Language-Model-as-a-Service [85.2210372920386]
This work proposes the Black-Box Tuning to optimize PTMs through derivative-free algorithms.
In particular, we invoke the CMA-ES to optimize the continuous prompt prepended to the input text by iteratively calling PTM inference APIs.
Our experimental results demonstrate that, black-box tuning with RoBERTa on a few labeled samples not only significantly outperforms manual prompt and GPT-3's in-context learning, but also surpasses the gradient-based counterparts.
arXiv Detail & Related papers (2022-01-10T18:17:05Z) - Ranking and Tuning Pre-trained Models: A New Paradigm of Exploiting
Model Hubs [136.4492678691406]
We propose a new paradigm of exploiting model hubs by ranking and tuning pre-trained models.
The best ranked PTM can be fine-tuned and deployed if we have no preference for the model's architecture.
The tuning part introduces a novel method for multiple PTMs tuning, which surpasses dedicated methods.
arXiv Detail & Related papers (2021-10-20T12:59:23Z) - Evaluating Pre-Trained Models for User Feedback Analysis in Software
Engineering: A Study on Classification of App-Reviews [2.66512000865131]
We study the accuracy and time efficiency of pre-trained neural language models (PTMs) for app review classification.
We set up different studies to evaluate PTMs in multiple settings.
In all cases, Micro and Macro Precision, Recall, and F1-scores will be used.
arXiv Detail & Related papers (2021-04-12T23:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.