Related papers: LiSum: Open Source Software License Summarization with Multi-Task Learning

LiSum: Open Source Software License Summarization with Multi-Task Learning

URL: http://arxiv.org/abs/2309.05074v2
Date: Fri, 22 Sep 2023 06:27:57 GMT
Title: LiSum: Open Source Software License Summarization with Multi-Task Learning
Authors: Linyu Li, Sihan Xu, Yang Liu, Ya Gao, Xiangrui Cai, Jiarun Wu, Wenli Song and Zheli Liu
Abstract summary: Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally. There exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand. Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization.
Score: 16.521420821183995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally. However, there exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand. In this paper, we conducted a 661-participants online survey to investigate the perspectives and practices of developers towards OSS licenses. The user study revealed an indeed need for an automated tool to facilitate license understanding. Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization. Specifically, we released the first high quality text summarization dataset and designed two tasks, i.e., license text summarization (LTS), aiming at generating a relatively short summary for an arbitrary license, and license term classification (LTC), focusing on the attitude inference towards a predefined set of key license terms (e.g., Distribute). Aiming at the two tasks, we present LiSum, a multi-task learning method to help developers overcome the obstacles of understanding OSS licenses. Comprehensive experiments demonstrated that the proposed jointly training objective boosted the performance on both tasks, surpassing state-of-the-art baselines with gains of at least 5 points w.r.t. F1 scores of four summarization metrics and achieving 95.13% micro average F1 score for classification simultaneously. We released all the datasets, the replication package, and the questionnaires for the community.

Related papers

A first look at License Variants in the PyPI Ecosystem [22.01881122680886]
We conduct an empirical study of license variants in the PyPI ecosystem.<n>We introduce LV-, a novel approach for efficient license variant analysis leveraging diff-based techniques and large language models.<n> LV- achieves an accuracy of 0.936 while reducing computational costs by 30%, and LV-Compat identifies 5.2 times more incompatible packages than existing methods with a precision of 0.98.
arXiv Detail & Related papers (2025-07-19T12:41:33Z)
"They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing [30.19362102481241]
We develop a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as first step toward Linked Open Model Data.
arXiv Detail & Related papers (2024-12-16T06:52:09Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals. We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs. Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z)
OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
We employ an exhaustive approach, scanning all files containing license'' in their filepath, and apply the winnowing algorithm for robust text matching. Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map.
arXiv Detail & Related papers (2024-09-07T13:34:55Z)
LiCoEval: Evaluating LLMs on License Compliance in Code Generation [27.368667936460508]
Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code.
arXiv Detail & Related papers (2024-08-05T14:09:30Z)
Towards Learning Abductive Reasoning using VSA Distributed Representations [56.31867341825068]
We introduce the Abductive Rule Learner with Context-awareness (ARLC) model. ARLC features a novel and more broadly applicable training objective for abductive reasoning. We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge.
arXiv Detail & Related papers (2024-06-27T12:05:55Z)
Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software. Developers may inadvertently violate the licenses of TPLs, leading to legal issues. There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z)
Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs) We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
LiResolver: License Incompatibility Resolution for Open Source Software [13.28021004336228]
LiResolver is a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software. Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization.
arXiv Detail & Related papers (2023-06-26T13:16:09Z)
Fact-Checking Complex Claims with Program-Guided Reasoning [99.7212240712869]
Program-Guided Fact-Checking (ProgramFC) is a novel fact-checking model that decomposes complex claims into simpler sub-tasks. We first leverage the in-context learning ability of large language models to generate reasoning programs. We execute the program by delegating each sub-task to the corresponding sub-task handler.
arXiv Detail & Related papers (2023-05-22T06:11:15Z)
Defining the role of open source software in research reproducibility [0.0]
I make a new proposal for the role of open source software. I look for explanation of its success from the perspectives of connectivism. I contend that engenders trust, which we routinely build in community via conversations.
arXiv Detail & Related papers (2022-04-26T19:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.