Related papers: Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

URL: http://arxiv.org/abs/2512.21871v1
Date: Fri, 26 Dec 2025 05:09:55 GMT
Title: Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?
Authors: Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou, Jun Wang, Boyu Xu, Yuyuan Li, Tianyu Du, Shouling Ji,
Abstract summary: Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks.<n>Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content in the context?
Score: 47.50752173848172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content -- such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.

Related papers

Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks [123.36265437655187]
Copyright Detective is an interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs.<n>It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive probing, and unlearning verification.
arXiv Detail & Related papers (2026-02-05T03:09:52Z)
Certified Mitigation of Worst-Case LLM Copyright Infringement [46.571805194176825]
"copyright takedown" methods are aimed at preventing models from generating content substantially similar to copyrighted ones.<n>We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown.<n>Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
arXiv Detail & Related papers (2025-04-22T17:16:53Z)
Do LLMs Know to Respect Copyright Notice? [11.14140288980773]
We investigate whether language models infringe upon copyrights when processing user input containing protected material. Our study offers a conservative evaluation of the extent to which language models may infringe upon copyrights. This research emphasizes the need for further investigation and the importance of ensuring LLMs respect copyright regulations.
arXiv Detail & Related papers (2024-11-02T04:45:21Z)
Measuring Copyright Risks of Large Language Model via Partial Information Probing [14.067687792633372]
We explore the data sources used to train Large Language Models (LLMs) We input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
arXiv Detail & Related papers (2024-09-20T18:16:05Z)
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.<n>We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)<n>We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z)
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation [24.644101178288476]
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns. LLMs may infringe on copyrights or overly restrict non-copyrighted texts. We propose lightweight, real-time defense to prevent the generation of copyrighted text.
arXiv Detail & Related papers (2024-06-18T18:00:03Z)
LLMs and Memorization: On Quality and Specificity of Copyright Compliance [0.0]
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act.
arXiv Detail & Related papers (2024-05-28T18:01:52Z)
©Plug-in Authorization for Human Content Copyright Protection in Text-to-Image Model [71.47762442337948]
State-of-the-art models create high-quality content without crediting original creators.<n>We propose the copyright Plug-in Authorization framework, introducing three operations: addition, extraction, and combination.<n>Experiments in artist-style replication and cartoon IP recreation demonstrate copyright plug-ins' effectiveness.
arXiv Detail & Related papers (2024-04-18T07:48:00Z)
Copyright Protection in Generative AI: A Technical Perspective [58.84343394349887]
Generative AI has witnessed rapid advancement in recent years, expanding their capabilities to create synthesized content such as text, images, audio, and code. The high fidelity and authenticity of contents generated by these Deep Generative Models (DGMs) have sparked significant copyright concerns. This work delves into this issue by providing a comprehensive overview of copyright protection from a technical perspective.
arXiv Detail & Related papers (2024-02-04T04:00:33Z)
A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works. Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z)
Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization. We present experiments with a range of language models over a collection of popular books and coding problems. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.