Assessing Project-Level Fine-Tuning of ML4SE Models
- URL: http://arxiv.org/abs/2206.03333v1
- Date: Tue, 7 Jun 2022 14:19:33 GMT
- Title: Assessing Project-Level Fine-Tuning of ML4SE Models
- Authors: Egor Bogomolov and Sergey Zhuravlev and Egor Spirin and Timofey
Bryksin
- Abstract summary: We investigate how the model's quality can be improved if we target a specific project.
We evaluate three models of different complexity and compare their quality in three settings.
Per-project fine-tuning can greatly improve the models' quality as they capture the project's domain and naming conventions.
- Score: 3.699097874146491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning for Software Engineering (ML4SE) is an actively growing
research area that focuses on methods that help programmers in their work. In
order to apply the developed methods in practice, they need to achieve
reasonable quality in order to help rather than distract developers. While the
development of new approaches to code representation and data collection
improves the overall quality of the models, it does not take into account the
information that we can get from the project at hand.
In this work, we investigate how the model's quality can be improved if we
target a specific project. We develop a framework to assess quality
improvements that models can get after fine-tuning for the method name
prediction task on a particular project. We evaluate three models of different
complexity and compare their quality in three settings: trained on a large
dataset of Java projects, further fine-tuned on the data from a particular
project, and trained from scratch on this data. We show that per-project
fine-tuning can greatly improve the models' quality as they capture the
project's domain and naming conventions. We open-source the tool we used for
data collection, as well as the code to run the experiments:
https://zenodo.org/record/6040745.
Related papers
- Leveraging Large Language Models for Software Model Completion: Results from Industrial and Public Datasets [6.585732390922304]
We evaluate the potential of large language models for model completion with retrieval-augmented generation.
We found that large language models are indeed a promising technology for supporting software model evolution.
arXiv Detail & Related papers (2024-06-25T15:43:20Z) - Generative Design through Quality-Diversity Data Synthesis and Language Models [5.196236145367301]
Two fundamental challenges face generative models in engineering applications: the acquisition of high-performing, diverse datasets, and the adherence to precise constraints in generated designs.
We propose a novel approach combining optimization, constraint satisfaction, and language models to tackle these challenges in architectural design.
arXiv Detail & Related papers (2024-05-16T11:30:08Z) - What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks.
We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Optimizing the AI Development Process by Providing the Best Support
Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z) - Learning code summarization from a small and local dataset [0.0]
Training on project-specific data, and testing on the same project, is a promising idea.
We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient.
We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art.
arXiv Detail & Related papers (2022-06-02T00:16:03Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning [65.268245109828]
In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models.
Deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning.
Model reprogramming enables resource-efficient cross-domain machine learning by repurposing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning.
arXiv Detail & Related papers (2022-02-22T02:33:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.