Assessing Project-Level Fine-Tuning of ML4SE Models
- URL: http://arxiv.org/abs/2206.03333v1
- Date: Tue, 7 Jun 2022 14:19:33 GMT
- Title: Assessing Project-Level Fine-Tuning of ML4SE Models
- Authors: Egor Bogomolov and Sergey Zhuravlev and Egor Spirin and Timofey
Bryksin
- Abstract summary: We investigate how the model's quality can be improved if we target a specific project.
We evaluate three models of different complexity and compare their quality in three settings.
Per-project fine-tuning can greatly improve the models' quality as they capture the project's domain and naming conventions.
- Score: 3.699097874146491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning for Software Engineering (ML4SE) is an actively growing
research area that focuses on methods that help programmers in their work. In
order to apply the developed methods in practice, they need to achieve
reasonable quality in order to help rather than distract developers. While the
development of new approaches to code representation and data collection
improves the overall quality of the models, it does not take into account the
information that we can get from the project at hand.
In this work, we investigate how the model's quality can be improved if we
target a specific project. We develop a framework to assess quality
improvements that models can get after fine-tuning for the method name
prediction task on a particular project. We evaluate three models of different
complexity and compare their quality in three settings: trained on a large
dataset of Java projects, further fine-tuned on the data from a particular
project, and trained from scratch on this data. We show that per-project
fine-tuning can greatly improve the models' quality as they capture the
project's domain and naming conventions. We open-source the tool we used for
data collection, as well as the code to run the experiments:
https://zenodo.org/record/6040745.
Related papers
- Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists [41.94295877935867]
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science.
We demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs.
arXiv Detail & Related papers (2024-10-30T17:59:01Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Leveraging Large Language Models for Software Model Completion: Results from Industrial and Public Datasets [6.585732390922304]
We evaluate the potential of large language models for model completion with retrieval-augmented generation.
We found that large language models are indeed a promising technology for supporting software model evolution.
arXiv Detail & Related papers (2024-06-25T15:43:20Z) - Generative Design through Quality-Diversity Data Synthesis and Language Models [5.196236145367301]
Two fundamental challenges face generative models in engineering applications: the acquisition of high-performing, diverse datasets, and the adherence to precise constraints in generated designs.
We propose a novel approach combining optimization, constraint satisfaction, and language models to tackle these challenges in architectural design.
arXiv Detail & Related papers (2024-05-16T11:30:08Z) - What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks.
We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Optimizing the AI Development Process by Providing the Best Support
Environment [0.756282840161499]
Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
arXiv Detail & Related papers (2023-04-29T00:44:50Z) - Learning code summarization from a small and local dataset [0.0]
Training on project-specific data, and testing on the same project, is a promising idea.
We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient.
We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art.
arXiv Detail & Related papers (2022-06-02T00:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.