LiSum: Open Source Software License Summarization with Multi-Task
Learning
- URL: http://arxiv.org/abs/2309.05074v2
- Date: Fri, 22 Sep 2023 06:27:57 GMT
- Title: LiSum: Open Source Software License Summarization with Multi-Task
Learning
- Authors: Linyu Li, Sihan Xu, Yang Liu, Ya Gao, Xiangrui Cai, Jiarun Wu, Wenli
Song and Zheli Liu
- Abstract summary: Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally.
There exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand.
Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization.
- Score: 16.521420821183995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open source software (OSS) licenses regulate the conditions under which users
can reuse, modify, and distribute the software legally. However, there exist
various OSS licenses in the community, written in a formal language, which are
typically long and complicated to understand. In this paper, we conducted a
661-participants online survey to investigate the perspectives and practices of
developers towards OSS licenses. The user study revealed an indeed need for an
automated tool to facilitate license understanding. Motivated by the user study
and the fast growth of licenses in the community, we propose the first study
towards automated license summarization. Specifically, we released the first
high quality text summarization dataset and designed two tasks, i.e., license
text summarization (LTS), aiming at generating a relatively short summary for
an arbitrary license, and license term classification (LTC), focusing on the
attitude inference towards a predefined set of key license terms (e.g.,
Distribute). Aiming at the two tasks, we present LiSum, a multi-task learning
method to help developers overcome the obstacles of understanding OSS licenses.
Comprehensive experiments demonstrated that the proposed jointly training
objective boosted the performance on both tasks, surpassing state-of-the-art
baselines with gains of at least 5 points w.r.t. F1 scores of four
summarization metrics and achieving 95.13% micro average F1 score for
classification simultaneously. We released all the datasets, the replication
package, and the questionnaires for the community.
Related papers
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals.
We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs.
Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z) - OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
We employ an exhaustive approach, scanning all files containing license'' in their filepath, and apply the winnowing algorithm for robust text matching.
Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - LiCoEval: Evaluating LLMs on License Compliance in Code Generation [27.368667936460508]
Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers.
LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production.
This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code.
arXiv Detail & Related papers (2024-08-05T14:09:30Z) - Towards Learning Abductive Reasoning using VSA Distributed Representations [56.31867341825068]
We introduce the Abductive Rule Learner with Context-awareness (ARLC) model.
ARLC features a novel and more broadly applicable training objective for abductive reasoning.
We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge.
arXiv Detail & Related papers (2024-06-27T12:05:55Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - LiResolver: License Incompatibility Resolution for Open Source Software [13.28021004336228]
LiResolver is a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software.
Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization.
arXiv Detail & Related papers (2023-06-26T13:16:09Z) - Fact-Checking Complex Claims with Program-Guided Reasoning [99.7212240712869]
Program-Guided Fact-Checking (ProgramFC) is a novel fact-checking model that decomposes complex claims into simpler sub-tasks.
We first leverage the in-context learning ability of large language models to generate reasoning programs.
We execute the program by delegating each sub-task to the corresponding sub-task handler.
arXiv Detail & Related papers (2023-05-22T06:11:15Z) - Defining the role of open source software in research reproducibility [0.0]
I make a new proposal for the role of open source software.
I look for explanation of its success from the perspectives of connectivism.
I contend that engenders trust, which we routinely build in community via conversations.
arXiv Detail & Related papers (2022-04-26T19:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.