Related papers: Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs

Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs

URL: http://arxiv.org/abs/2502.19413v2
Date: Fri, 18 Apr 2025 15:48:01 GMT
Title: Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
Authors: Christoph Schuhmann, Gollam Rabby, Ameya Prabhu, Tawsif Ahmed, Andreas Hochlehnert, Huu Nguyen, Nick Akinci, Ludwig Schmidt, Robert Kaczmarczyk, Sören Auer, Jenia Jitsev, Matthias Bethge,
Abstract summary: Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge.<n>We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts.<n>We propose a new idea for the community to adopt: convert scholarly documents into knowledge preserving, but style agnostic representations.
Score: 26.952396644343537
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We propose a new idea for the community to adopt: convert scholarly documents into knowledge preserving, but style agnostic representations we term Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95\%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.

Related papers

We Should Separate Memorization from Copyright [29.232307526669967]
We argue that memorization should not be equated with copying and should not be used as a proxy for copyright infringement.<n>We advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards.
arXiv Detail & Related papers (2026-02-09T13:24:06Z)
Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content? [47.50752173848172]
Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks.<n>Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content in the context?
arXiv Detail & Related papers (2025-12-26T05:09:55Z)
Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center [49.85176045690678]
Generative artificial intelligence (AI) deployment in academic medical settings raises copyright compliance concerns.<n>Dana-Farber Cancer Institute implemented GPT4DFCI, an internal generative AI tool utilizing OpenAI models.<n>Four teams attempted to extract copyrighted content from GPT4DFCI across four domains.
arXiv Detail & Related papers (2025-06-26T23:11:49Z)
Extracting memorized pieces of (copyrighted) books from open-weight language models [64.69834802660128]
Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright.<n>We show that it's possible to extract substantial parts of at least some books from different LLMs.<n>We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
arXiv Detail & Related papers (2025-05-18T21:06:32Z)
Measuring Copyright Risks of Large Language Model via Partial Information Probing [14.067687792633372]
We explore the data sources used to train Large Language Models (LLMs) We input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
arXiv Detail & Related papers (2024-09-20T18:16:05Z)
A Multi-Source Heterogeneous Knowledge Injected Prompt Learning Method for Legal Charge Prediction [3.52209555388364]
We propose a prompt learning framework-based method for modeling case descriptions. We leverage multi-source external knowledge from a legal knowledge base, a conversational LLM, and legal articles. Our method achieves state-of-the-art results on CAIL-2018, the largest legal charge prediction dataset.
arXiv Detail & Related papers (2024-08-05T04:53:17Z)
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts. We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs) We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z)
AKEW: Assessing Knowledge Editing in the Wild [79.96813982502952]
AKEW (Assessing Knowledge Editing in the Wild) is a new practical benchmark for knowledge editing. It fully covers three editing settings of knowledge updates: structured facts, unstructured texts as facts, and extracted triplets. Through extensive experiments, we demonstrate the considerable gap between state-of-the-art knowledge-editing methods and practical scenarios.
arXiv Detail & Related papers (2024-02-29T07:08:34Z)
A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works. Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z)
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia [57.31074448586854]
Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown. We present a novel method to study grounding abilities using Fakepedia.
arXiv Detail & Related papers (2023-12-04T17:35:42Z)
Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization. We present experiments with a range of language models over a collection of popular books and coding problems. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z)
Source Attribution for Large Language Model-Generated Data [57.85840382230037]
It is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text. We show that this problem can be tackled by watermarking. We propose a source attribution framework that satisfies these key properties due to our algorithmic designs.
arXiv Detail & Related papers (2023-10-01T12:02:57Z)
Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs [54.22416829200613]
Eva-KELLM is a new benchmark for evaluating knowledge editing of large language models. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results.
arXiv Detail & Related papers (2023-08-19T09:17:19Z)
Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and Ethics [1.933681537640272]
This position paper probes the copyright interests of open data sets used to train large language models (LLMs) Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data?
arXiv Detail & Related papers (2023-04-06T03:09:26Z)
The Semantic Scholar Open Data Platform [92.2948743167744]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z)
InvBERT: Text Reconstruction from Contextualized Embeddings used for Derived Text Formats of Literary Works [1.6058099298620423]
Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature. Due to copyright restrictions, the availability of relevant digitized literary works is limited. Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical.
arXiv Detail & Related papers (2021-09-21T11:35:41Z)
CitationIE: Leveraging the Citation Graph for Scientific Information Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers. We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.