Related papers: Ecosystem of Large Language Models for Code

Ecosystem of Large Language Models for Code

URL: http://arxiv.org/abs/2405.16746v2
Date: Sun, 29 Sep 2024 06:30:27 GMT
Title: Ecosystem of Large Language Models for Code
Authors: Zhou Yang, Jieke Shi, Premkumar Devanbu, David Lo,
Abstract summary: This paper introduces a pioneering analysis of the code model ecosystem. We first identify the popular and influential datasets, models, and contributors. The top 3 most popular reuse types are fine-tuning, architecture sharing, and quantization.
Score: 7.7454423388704745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The availability of vast amounts of publicly accessible data of source code and the advances in modern language models, coupled with increasing computational resources, have led to a remarkable surge in the development of large language models for code (LLM4Code, for short). The interaction between code datasets and models gives rise to a complex ecosystem characterized by intricate dependencies that are worth studying. This paper introduces a pioneering analysis of the code model ecosystem. Utilizing Hugging Face -- the premier hub for transformer-based models -- as our primary source, we curate a list of datasets and models that are manually confirmed to be relevant to software engineering. By analyzing the ecosystem, we first identify the popular and influential datasets, models, and contributors. The popularity is quantified by various metrics, including the number of downloads, the number of likes, the number of reuses, etc. The ecosystem follows a power-law distribution, indicating that users prefer widely recognized models and datasets. Then, we manually categorize how models in the ecosystem are reused into nine categories, analyzing prevalent model reuse practices. The top 3 most popular reuse types are fine-tuning, architecture sharing, and quantization. We also explore the practices surrounding the publication of LLM4Code, specifically focusing on documentation practice and license selection. We find that the documentation in the ecosystem contains less information than that in general artificial intelligence (AI)-related repositories hosted on GitHub. Additionally, the license usage is also different from other software repositories. Models in the ecosystem adopt some AI-specific licenses, e.g., RAIL (Responsible AI Licenses) and AI model license agreement.

Related papers

The ML Supply Chain in the Era of Software 2.0: Lessons Learned from Hugging Face [10.531612371200625]
We conduct an extensive analysis of 760,460 models and 175,000 datasets mined from the popular model-sharing site Hugging Face. We evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement. Our results motivate multiple research avenues, including the need for better license management for ML models/datasets.
arXiv Detail & Related papers (2025-02-06T20:17:05Z)
Cuvis.Ai: An Open-Source, Low-Code Software Ecosystem for Hyperspectral Processing and Classification [0.4038539043067986]
cuvis.ai is an open-source and low-code software ecosystem for data acquisition, preprocessing, and model training. The package is written in Python and provides wrappers around common machine learning libraries.
arXiv Detail & Related papers (2024-11-18T06:33:40Z)
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities [89.40778301238642]
Model merging is an efficient empowerment technique in the machine learning community. There is a significant gap in the literature regarding a systematic and thorough review of these techniques.
arXiv Detail & Related papers (2024-08-14T16:58:48Z)
EduNLP: Towards a Unified and Modularized Library for Educational Resources [78.8523961816045]
We present a unified, modularized, and extensive library, EduNLP, focusing on educational resource understanding. In the library, we decouple the whole workflow to four key modules with consistent interfaces including data configuration, processing, model implementation, and model evaluation. For the current version, we primarily provide 10 typical models from four categories, and 5 common downstream-evaluation tasks in the education domain on 8 subjects for users' usage.
arXiv Detail & Related papers (2024-06-03T12:45:40Z)
Model Callers for Transforming Predictive and Generative AI Applications [2.7195102129095003]
We introduce a novel software abstraction termed "model caller" Model callers act as an intermediary for AI and ML model calling. We have released a prototype Python library for model callers, accessible for installation via pip or for download from GitHub.
arXiv Detail & Related papers (2024-04-17T12:21:06Z)
A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys) [57.30228361181045]
This survey connects key advancements in recommender systems using Generative Models (Gen-RecSys) It covers: interaction-driven generative models; the use of large language models (LLM) and textual data for natural language recommendation; and the integration of multimodal models for generating and processing images/videos in RS. Our work highlights necessary paradigms for evaluating the impact and harm of Gen-RecSys and identifies open challenges.
arXiv Detail & Related papers (2024-03-31T06:57:57Z)
Breaking the Barrier: Utilizing Large Language Models for Industrial Recommendation Systems through an Inferential Knowledge Graph [19.201697767418597]
We propose a novel Large Language Model based Complementary Knowledge Enhanced Recommendation System (LLM-KERec) It extracts unified concept terms from item and user information to capture user intent transitions and adapt to new items. Extensive experiments conducted on three industry datasets demonstrate the significant performance improvement of our model compared to existing approaches.
arXiv Detail & Related papers (2024-02-21T12:22:01Z)
Generative AI for Software Metadata: Overview of the Information Retrieval in Software Engineering Track at FIRE 2023 [18.616716369775883]
The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments. The dataset consists of 9048 code comments and surrounding code snippet pairs extracted from open source C based projects. The labels generated from large language models increase the bias in the prediction model but lead to less over-fitted results.
arXiv Detail & Related papers (2023-10-27T14:13:23Z)
An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code [1.0742675209112622]
This study investigates if publications that trained language models for software engineering tasks share source code and trained artifacts. From 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks. We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks.
arXiv Detail & Related papers (2023-07-05T17:13:00Z)
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Our library supports a collection of pretrained Code LLM models and popular code benchmarks. We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z)
A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multimodal [57.8455911689554]
Knowledge graph reasoning (KGR) aims to deduce new facts from existing facts based on mined logic rules underlying knowledge graphs (KGs) It has been proven to significantly benefit the usage of KGs in many AI applications, such as question answering, recommendation systems, and etc.
arXiv Detail & Related papers (2022-12-12T08:40:04Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.