LARCH: Large Language Model-based Automatic Readme Creation with
Heuristics
- URL: http://arxiv.org/abs/2308.03099v2
- Date: Tue, 22 Aug 2023 09:48:20 GMT
- Title: LARCH: Large Language Model-based Automatic Readme Creation with
Heuristics
- Authors: Yuta Koreeda, Terufumi Morishita, Osamu Imaichi, Yasuhiro Sogawa
- Abstract summary: We show that large language models (LLMs) are capable of generating a coherent and factually correct readme if we can identify a code fragment that is representative of the repository.
We develop LARCH (LLM-based Automatic Readme Creation with Heuristics) which leverages representative code identification with weak supervision.
We demonstrate that LARCH can generate coherent and factually correct readmes in the majority of cases, outperforming a baseline that does not rely on representative code identification.
- Score: 9.820370420194948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Writing a readme is a crucial aspect of software development as it plays a
vital role in managing and reusing program code. Though it is a pain point for
many developers, automatically creating one remains a challenge even with the
recent advancements in large language models (LLMs), because it requires
generating an abstract description from thousands of lines of code. In this
demo paper, we show that LLMs are capable of generating a coherent and
factually correct readmes if we can identify a code fragment that is
representative of the repository. Building upon this finding, we developed
LARCH (LLM-based Automatic Readme Creation with Heuristics) which leverages
representative code identification with heuristics and weak supervision.
Through human and automated evaluations, we illustrate that LARCH can generate
coherent and factually correct readmes in the majority of cases, outperforming
a baseline that does not rely on representative code identification. We have
made LARCH open-source and provided a cross-platform Visual Studio Code
interface and command-line interface, accessible at
https://github.com/hitachi-nlp/larch. A demo video showcasing LARCH's
capabilities is available at https://youtu.be/ZUKkh5ED-O4.
Related papers
- CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings [32.72039589832989]
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency.
These advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards.
We propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains.
arXiv Detail & Related papers (2025-03-17T21:41:37Z) - Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights [9.414198519543564]
We present codellm-devkit (hereafter, CLDK'), an open-source library that significantly simplifies the process of performing program analysis.
CLDK offers developers an intuitive and user-friendly interface, making it incredibly easy to provide rich program analysis context to code LLMs.
arXiv Detail & Related papers (2024-10-16T20:05:59Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - AskIt: Unified Programming Interface for Programming with Large Language
Models [0.0]
Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks.
This paper introduces AskIt, a domain-specific language specifically designed for LLMs.
Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks.
arXiv Detail & Related papers (2023-08-29T21:44:27Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - StarCoder: may the source be with you! [79.93915935620798]
The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length.
StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories.
arXiv Detail & Related papers (2023-05-09T08:16:42Z) - Sequence Model Design for Code Completion in the Modern IDE [3.4824234779710452]
We propose a novel design for predicting top-k next tokens that combines static analysis' ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them.
Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency.
arXiv Detail & Related papers (2020-04-10T22:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.