Related papers: Towards a Classification of Open-Source ML Models and Datasets for Software Engineering

Towards a Classification of Open-Source ML Models and Datasets for Software Engineering

URL: http://arxiv.org/abs/2411.09683v1
Date: Thu, 14 Nov 2024 18:52:05 GMT
Title: Towards a Classification of Open-Source ML Models and Datasets for Software Engineering
Authors: Alexandra González, Xavier Franch, David Lo, Silverio Martínez-Fernández,
Abstract summary: Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks. These resources lack a classification tailored to Software Engineering (SE) needs. We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
Score: 52.257764273141184
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.

Related papers

On Domain-Specific Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs through post-training. We focus on data synthesis, training pipelines, and task evaluation. We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z)
FuseGen: PLM Fusion for Data-generation based Zero-shot Learning [18.51772808242954]
FuseGen is a novel data generation-based zero-shot learning framework. It introduces a new criteria for subset selection from synthetic datasets. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
arXiv Detail & Related papers (2024-06-18T11:55:05Z)
Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset [9.218130273952383]
Software engineering activities have been revolutionized by the advent of pre-trained models (PTMs) The Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models. This paper introduces an approach to enable the automatic classification of PTMs for SE tasks.
arXiv Detail & Related papers (2024-05-21T20:26:17Z)
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software [6.243303627949341]
This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs. The dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. Our analysis provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation.
arXiv Detail & Related papers (2024-02-01T15:55:50Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
PeaTMOSS: Mining Pre-Trained Models in Open-Source Software [6.243303627949341]
We present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them.
arXiv Detail & Related papers (2023-10-05T15:58:45Z)
Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models [62.838689691468666]
We propose Federated Black-Box Prompt Tuning (Fed-BBPT) to optimally harness each local dataset. Fed-BBPT capitalizes on a central server that aids local users in collaboratively training a prompt generator through regular aggregation. Relative to extensive fine-tuning, Fed-BBPT proficiently sidesteps memory challenges tied to PTM storage and fine-tuning on local machines.
arXiv Detail & Related papers (2023-10-04T19:30:49Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews [2.66512000865131]
We study the accuracy and time efficiency of pre-trained neural language models (PTMs) for app review classification. We set up different studies to evaluate PTMs in multiple settings. In all cases, Micro and Macro Precision, Recall, and F1-scores will be used.
arXiv Detail & Related papers (2021-04-12T23:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.