An Exploratory Literature Study on Sharing and Energy Use of Language
Models for Source Code
- URL: http://arxiv.org/abs/2307.02443v1
- Date: Wed, 5 Jul 2023 17:13:00 GMT
- Title: An Exploratory Literature Study on Sharing and Energy Use of Language
Models for Source Code
- Authors: Max Hort and Anastasiia Grishina and Leon Moonen
- Abstract summary: This study investigates if publications that trained language models for software engineering tasks share source code and trained artifacts.
From 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks.
We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks.
- Score: 1.0742675209112622
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models trained on source code can support a variety of
software development tasks, such as code recommendation and program repair.
Large amounts of data for training such models benefit the models' performance.
However, the size of the data and models results in long training times and
high energy consumption. While publishing source code allows for replicability,
users need to repeat the expensive training process if models are not shared.
The main goal of the study is to investigate if publications that trained
language models for software engineering (SE) tasks share source code and
trained artifacts. The second goal is to analyze the transparency on training
energy usage. We perform a snowballing-based literature search to find
publications on language models for source code, and analyze their reusability
from a sustainability standpoint.
From 494 unique publications, we identified 293 relevant publications that
use language models to address code-related tasks. Among them, 27% (79 out of
293) make artifacts available for reuse. This can be in the form of tools or
IDE plugins designed for specific tasks or task-agnostic models that can be
fine-tuned for a variety of downstream tasks. Moreover, we collect insights on
the hardware used for model training, as well as training time, which together
determine the energy consumption of the development process. We find that there
are deficiencies in the sharing of information and artifacts for current
studies on source code models for software engineering tasks, with 40% of the
surveyed papers not sharing source code or trained artifacts. We recommend the
sharing of source code as well as trained artifacts, to enable sustainable
reproducibility. Moreover, comprehensive information on training times and
hardware configurations should be shared for transparency on a model's carbon
footprint.
Related papers
- In-Context Code-Text Learning for Bimodal Software Engineering [26.0027882745058]
Bimodal software analysis initially appeared to be within reach with the advent of large language models.
We postulate that in-context learning for the code-text bimodality is a promising avenue.
We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format.
arXiv Detail & Related papers (2024-10-08T19:42:00Z) - DeepCodeProbe: Towards Understanding What Models Trained on Code Learn [13.135962181354465]
We introduce DeepCodeProbe, a probing approach that examines the syntax and representation learning abilities of ML models.
Our study applies DeepCodeProbe to state-of-the-art models for code clone detection, code summarization, and comment generation.
Findings reveal that while small models capture abstract syntactic representations, their ability to fully grasp programming language syntax is limited.
arXiv Detail & Related papers (2024-07-11T23:16:44Z) - EduNLP: Towards a Unified and Modularized Library for Educational Resources [78.8523961816045]
We present a unified, modularized, and extensive library, EduNLP, focusing on educational resource understanding.
In the library, we decouple the whole workflow to four key modules with consistent interfaces including data configuration, processing, model implementation, and model evaluation.
For the current version, we primarily provide 10 typical models from four categories, and 5 common downstream-evaluation tasks in the education domain on 8 subjects for users' usage.
arXiv Detail & Related papers (2024-06-03T12:45:40Z) - INSPECT: Intrinsic and Systematic Probing Evaluation for Code
Transformers [7.255653248042546]
We use a framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code.
We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline.
We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics.
arXiv Detail & Related papers (2023-12-08T15:21:54Z) - NLLB-CLIP -- train performant multilingual image retrieval model on a
budget [65.268245109828]
We present NLLB-CLIP - CLIP model with a text encoder from the NLLB model.
We used an automatically created dataset of 106,246 good-quality images with captions in 201 languages.
We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
arXiv Detail & Related papers (2023-09-04T23:26:11Z) - SoTaNa: The Open-Source Software Development Assistant [81.86136560157266]
SoTaNa is an open-source software development assistant.
It generates high-quality instruction-based data for the domain of software engineering.
It employs a parameter-efficient fine-tuning approach to enhance the open-source foundation model, LLaMA.
arXiv Detail & Related papers (2023-08-25T14:56:21Z) - Active Code Learning: Benchmarking Sample-Efficient Training of Code
Models [35.54965391159943]
In software engineering (ML4Code), efficiently training models of code with less human effort has become an emergent problem.
Active learning is such a technique that allows developers to train a model with reduced data while producing models with desired performance.
This paper builds the first benchmark to study this critical problem - active code learning.
arXiv Detail & Related papers (2023-06-02T03:26:11Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning [65.268245109828]
In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models.
Deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning.
Model reprogramming enables resource-efficient cross-domain machine learning by repurposing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning.
arXiv Detail & Related papers (2022-02-22T02:33:54Z) - AstBERT: Enabling Language Model for Code Understanding with Abstract
Syntax Tree [3.1087379479634927]
We propose the AstBERT model, a pre-trained language model aiming to better understand the programming language (PL) using the abstract syntax tree (AST)
Specifically, we collect a colossal amount of source codes (both java and python) from GitHub, in which information of the source codes can be interpreted and integrated.
Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks.
arXiv Detail & Related papers (2022-01-20T03:27:26Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.