LiSum: Open Source Software License Summarization with Multi-Task
Learning
- URL: http://arxiv.org/abs/2309.05074v2
- Date: Fri, 22 Sep 2023 06:27:57 GMT
- Title: LiSum: Open Source Software License Summarization with Multi-Task
Learning
- Authors: Linyu Li, Sihan Xu, Yang Liu, Ya Gao, Xiangrui Cai, Jiarun Wu, Wenli
Song and Zheli Liu
- Abstract summary: Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally.
There exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand.
Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization.
- Score: 16.521420821183995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open source software (OSS) licenses regulate the conditions under which users
can reuse, modify, and distribute the software legally. However, there exist
various OSS licenses in the community, written in a formal language, which are
typically long and complicated to understand. In this paper, we conducted a
661-participants online survey to investigate the perspectives and practices of
developers towards OSS licenses. The user study revealed an indeed need for an
automated tool to facilitate license understanding. Motivated by the user study
and the fast growth of licenses in the community, we propose the first study
towards automated license summarization. Specifically, we released the first
high quality text summarization dataset and designed two tasks, i.e., license
text summarization (LTS), aiming at generating a relatively short summary for
an arbitrary license, and license term classification (LTC), focusing on the
attitude inference towards a predefined set of key license terms (e.g.,
Distribute). Aiming at the two tasks, we present LiSum, a multi-task learning
method to help developers overcome the obstacles of understanding OSS licenses.
Comprehensive experiments demonstrated that the proposed jointly training
objective boosted the performance on both tasks, surpassing state-of-the-art
baselines with gains of at least 5 points w.r.t. F1 scores of four
summarization metrics and achieving 95.13% micro average F1 score for
classification simultaneously. We released all the datasets, the replication
package, and the questionnaires for the community.
Related papers
- Towards Learning Abductive Reasoning using VSA Distributed Representations [56.31867341825068]
We introduce the Abductive Rule Learner with Context-awareness (ARLC) model.
ARLC features a novel and more broadly applicable training objective for abductive reasoning.
We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge.
arXiv Detail & Related papers (2024-06-27T12:05:55Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
We introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions.
We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE)
Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - Detecting and Fixing Violations of Modification Terms in Open Source
Licenses during Forking [4.682961105225832]
We first empirically characterize modification terms in 47 open source licenses.
Inspired by our study, we then design LiVo to automatically detect and fix violations of modification terms in open source licenses during forking.
arXiv Detail & Related papers (2023-10-12T02:37:06Z) - Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems.
We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z) - The Software Heritage License Dataset (2022 Edition) [0.0]
The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided.
The dataset can be used to conduct empirical studies on open source licensing, training of automated license cryptographics, natural language processing (NLP) analyses of legal texts.
arXiv Detail & Related papers (2023-08-22T08:01:07Z) - LiResolver: License Incompatibility Resolution for Open Source Software [13.28021004336228]
LiResolver is a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software.
Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization.
arXiv Detail & Related papers (2023-06-26T13:16:09Z) - Fact-Checking Complex Claims with Program-Guided Reasoning [99.7212240712869]
Program-Guided Fact-Checking (ProgramFC) is a novel fact-checking model that decomposes complex claims into simpler sub-tasks.
We first leverage the in-context learning ability of large language models to generate reasoning programs.
We execute the program by delegating each sub-task to the corresponding sub-task handler.
arXiv Detail & Related papers (2023-05-22T06:11:15Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z) - Defining the role of open source software in research reproducibility [0.0]
I make a new proposal for the role of open source software.
I look for explanation of its success from the perspectives of connectivism.
I contend that engenders trust, which we routinely build in community via conversations.
arXiv Detail & Related papers (2022-04-26T19:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.