LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance
- URL: http://arxiv.org/abs/2501.00106v1
- Date: Mon, 30 Dec 2024 19:04:13 GMT
- Title: LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance
- Authors: Jingwen Tan, Gopi Krishnan Rajbahadur, Zi Li, Xiangfu Song, Jianshan Lin, Dan Li, Zibin Zheng, Ahmed E. Hassan,
- Abstract summary: We introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis.
We evaluate existing legal FMs and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%.
We demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy.
- Score: 27.595354325922436
- License:
- Abstract: Dataset license compliance is a critical yet complex aspect of developing commercial AI products, particularly with the increasing use of publicly available datasets. Ambiguities in dataset licenses pose significant legal risks, making it challenging even for software IP lawyers to accurately interpret rights and obligations. In this paper, we introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis. We first evaluate existing legal FMs (i.e., FMs specialized in understanding and processing legal texts) and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%. LicenseGPT, fine-tuned on a curated dataset of 500 licenses annotated by legal experts, significantly improves PA to 64.30%, outperforming both legal and general-purpose FMs. Through an A/B test and user study with software IP lawyers, we demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy. Software IP lawyers perceive LicenseGPT as a valuable supplementary tool that enhances efficiency while acknowledging the need for human oversight in complex cases. Our work underscores the potential of specialized AI tools in legal practice and offers a publicly available resource for practitioners and researchers.
Related papers
- OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
This study presents a reusable and comprehensive dataset of open source software (OSS) licenses.
We found and identified 5.5 million distinct license blobs in OSS projects.
The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model [44.71845500433037]
We introduce LawGPT, the first open-source model specifically designed for Chinese legal applications.
LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning.
Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model.
arXiv Detail & Related papers (2024-06-07T03:52:56Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - LiSum: Open Source Software License Summarization with Multi-Task
Learning [16.521420821183995]
Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally.
There exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand.
Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization.
arXiv Detail & Related papers (2023-09-10T16:43:51Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model [30.30848216845138]
Chatlaw is an innovative legal assistant utilizing a Mixture-of-Experts (MoE) model and a multi-agent system.
By integrating knowledge graphs with artificial screening, we construct a high-quality legal dataset to train the MoE model.
Our MoE model outperforms GPT-4 in the Lawbench and Unified Exam Qualification for Legal Professionals by 7.73% in accuracy and 11 points, respectively.
arXiv Detail & Related papers (2023-06-28T10:48:34Z) - FedSOV: Federated Model Secure Ownership Verification with Unforgeable
Signature [60.99054146321459]
Federated learning allows multiple parties to collaborate in learning a global model without revealing private data.
We propose a cryptographic signature-based federated learning model ownership verification scheme named FedSOV.
arXiv Detail & Related papers (2023-05-10T12:10:02Z) - Can I use this publicly available dataset to build commercial AI
software? Most likely not [8.853674186565934]
We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software.
Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
arXiv Detail & Related papers (2021-11-03T17:44:06Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.