LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance
- URL: http://arxiv.org/abs/2501.00106v1
- Date: Mon, 30 Dec 2024 19:04:13 GMT
- Title: LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance
- Authors: Jingwen Tan, Gopi Krishnan Rajbahadur, Zi Li, Xiangfu Song, Jianshan Lin, Dan Li, Zibin Zheng, Ahmed E. Hassan,
- Abstract summary: We introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis.<n>We evaluate existing legal FMs and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%.<n>We demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy.
- Score: 27.595354325922436
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Dataset license compliance is a critical yet complex aspect of developing commercial AI products, particularly with the increasing use of publicly available datasets. Ambiguities in dataset licenses pose significant legal risks, making it challenging even for software IP lawyers to accurately interpret rights and obligations. In this paper, we introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis. We first evaluate existing legal FMs (i.e., FMs specialized in understanding and processing legal texts) and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%. LicenseGPT, fine-tuned on a curated dataset of 500 licenses annotated by legal experts, significantly improves PA to 64.30%, outperforming both legal and general-purpose FMs. Through an A/B test and user study with software IP lawyers, we demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy. Software IP lawyers perceive LicenseGPT as a valuable supplementary tool that enhances efficiency while acknowledging the need for human oversight in complex cases. Our work underscores the potential of specialized AI tools in legal practice and offers a publicly available resource for practitioners and researchers.
Related papers
- Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing [45.6582862121583]
This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone.
It argues that tracking dataset redistribution and its full lifecycle is essential.
We show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts.
arXiv Detail & Related papers (2025-03-04T16:57:53Z) - OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
This study presents a reusable and comprehensive dataset of open source software (OSS) licenses.<n>We found and identified 5.5 million distinct license blobs in OSS projects.<n>The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - InternLM-Law: An Open Source Chinese Legal Large Language Model [72.2589401309848]
InternLM-Law is a specialized LLM tailored for addressing diverse legal queries related to Chinese laws.
We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries.
InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks.
arXiv Detail & Related papers (2024-06-21T06:19:03Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model [44.71845500433037]
We introduce LawGPT, the first open-source model specifically designed for Chinese legal applications.
LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning.
Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model.
arXiv Detail & Related papers (2024-06-07T03:52:56Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model [30.30848216845138]
Chatlaw is an innovative legal assistant utilizing a Mixture-of-Experts (MoE) model and a multi-agent system.
By integrating knowledge graphs with artificial screening, we construct a high-quality legal dataset to train the MoE model.
Our MoE model outperforms GPT-4 in the Lawbench and Unified Exam Qualification for Legal Professionals by 7.73% in accuracy and 11 points, respectively.
arXiv Detail & Related papers (2023-06-28T10:48:34Z) - FedSOV: Federated Model Secure Ownership Verification with Unforgeable
Signature [60.99054146321459]
Federated learning allows multiple parties to collaborate in learning a global model without revealing private data.
We propose a cryptographic signature-based federated learning model ownership verification scheme named FedSOV.
arXiv Detail & Related papers (2023-05-10T12:10:02Z) - Can I use this publicly available dataset to build commercial AI
software? Most likely not [8.853674186565934]
We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software.
Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
arXiv Detail & Related papers (2021-11-03T17:44:06Z) - Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents [56.40163943394202]
We release the Longformer-based pre-trained language model, named as Lawformer, for Chinese legal long documents understanding.
We evaluate Lawformer on a variety of LegalAI tasks, including judgment prediction, similar case retrieval, legal reading comprehension, and legal question answering.
arXiv Detail & Related papers (2021-05-09T09:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.