Related papers: Evaluating Transfer Learning for Simplifying GitHub READMEs

Evaluating Transfer Learning for Simplifying GitHub READMEs

URL: http://arxiv.org/abs/2308.09940v1
Date: Sat, 19 Aug 2023 08:20:41 GMT
Title: Evaluating Transfer Learning for Simplifying GitHub READMEs
Authors: Haoyu Gao, Christoph Treude and Mansooreh Zahedi
Abstract summary: This study explores the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub files. We collected software-related pairs of GitHub files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions. Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning.
Score: 11.219774223416648
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Software documentation captures detailed knowledge about a software product, e.g., code, technologies, and design. It plays an important role in the coordination of development teams and in conveying ideas to various stakeholders. However, software documentation can be hard to comprehend if it is written with jargon and complicated sentence structure. In this study, we explored the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub README files. We collected software-related pairs of GitHub README files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions. To mitigate the sparse and noisy nature of the software-related simplification dataset, we applied general text simplification knowledge to this field. Since many general-domain difficult-to-simple Wikipedia document pairs are already publicly available, we explored the potential of transfer learning by first training the model on the Wikipedia data and then fine-tuning it on the README data. Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning. The transfer learning model using the best checkpoint trained on a general topic corpus achieved the best performance of 34.68 BLEU score and statistically significantly higher human annotation scores compared to the rest of the schemes and baselines. We conclude that using transfer learning is a promising direction to circumvent the lack of data and drift style problem in software README files simplification and achieved a better trade-off between simplification and preservation of meaning.

Related papers

Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm. We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning [89.89857766491475]
We propose a complex reasoning schema over KG upon large language models (LLMs) We augment the arbitrary first-order logical queries via binary tree decomposition to stimulate the reasoning capability of LLMs. Experiments across widely used datasets demonstrate that LACT has substantial improvements(brings an average +5.5% MRR score) over advanced methods.
arXiv Detail & Related papers (2024-05-02T18:12:08Z)
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner. Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z)
Software Entity Recognition with Noise-Robust Learning [31.259250137320468]
We leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types. We then propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks.
arXiv Detail & Related papers (2023-08-21T08:41:46Z)
TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning. We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge. Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z)
LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
Towards Building the Federated GPT: Federated Instruction Tuning [66.7900343035733]
This paper introduces Federated Instruction Tuning (FedIT) as the learning framework for the instruction tuning of large language models (LLMs) We demonstrate that by exploiting the heterogeneous and diverse sets of instructions on the client's end with FedIT, we improved the performance of LLMs compared to centralized training with only limited local instructions.
arXiv Detail & Related papers (2023-05-09T17:42:34Z)
Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision. We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform for NLP Applications [65.87067607849757]
EasyTransfer is a platform to develop deep Transfer Learning algorithms for Natural Language Processing (NLP) applications. EasyTransfer supports various NLP models in the ModelZoo, including mainstream PLMs and multi-modality models. EasyTransfer is currently deployed at Alibaba to support a variety of business scenarios.
arXiv Detail & Related papers (2020-11-18T18:41:27Z)
A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens. We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.