Related papers: Beyond English: Unveiling Multilingual Bias in LLM Copyright Compliance

Beyond English: Unveiling Multilingual Bias in LLM Copyright Compliance

URL: http://arxiv.org/abs/2503.05713v1
Date: Fri, 14 Feb 2025 16:59:10 GMT
Title: Beyond English: Unveiling Multilingual Bias in LLM Copyright Compliance
Authors: Yupeng Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie,
Abstract summary: Large Language Models (LLMs) have raised significant concerns regarding the fair use of copyright-protected content.<n>Do LLMs exhibit bias in protecting copyrighted works across languages?<n>Is it easier to elicit copyrighted content using prompts in specific languages?
Score: 17.21382682644513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have raised significant concerns regarding the fair use of copyright-protected content. While prior studies have examined the extent to which LLMs reproduce copyrighted materials, they have predominantly focused on English, neglecting multilingual dimensions of copyright protection. In this work, we investigate multilingual biases in LLM copyright protection by addressing two key questions: (1) Do LLMs exhibit bias in protecting copyrighted works across languages? (2) Is it easier to elicit copyrighted content using prompts in specific languages? To explore these questions, we construct a dataset of popular song lyrics in English, French, Chinese, and Korean and systematically probe seven LLMs using prompts in these languages. Our findings reveal significant imbalances in LLMs' handling of copyrighted content, both in terms of the language of the copyrighted material and the language of the prompt. These results highlight the need for further research and development of more robust, language-agnostic copyright protection mechanisms to ensure fair and consistent protection across languages.

Related papers

Extracting memorized pieces of (copyrighted) books from open-weight language models [64.69834802660128]
Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright.<n>We show that it's possible to extract substantial parts of at least some books from different LLMs.<n>We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
arXiv Detail & Related papers (2025-05-18T21:06:32Z)
Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages [51.96666324242191]
We analyze whether user utilization of novel writing assistants in a charity advertisement writing task is affected by the AI's performance in a second language.<n>We quantify the extent to which these patterns translate into the persuasiveness of generated charity advertisements.
arXiv Detail & Related papers (2025-02-13T17:49:30Z)
Do LLMs Know to Respect Copyright Notice? [11.14140288980773]
We investigate whether language models infringe upon copyrights when processing user input containing protected material. Our study offers a conservative evaluation of the extent to which language models may infringe upon copyrights. This research emphasizes the need for further investigation and the importance of ensuring LLMs respect copyright regulations.
arXiv Detail & Related papers (2024-11-02T04:45:21Z)
How Do Multilingual Language Models Remember Facts? [50.13632788453612]
We show that previously identified recall mechanisms in English largely apply to multilingual contexts.<n>We localize the role of language during recall, finding that subject enrichment is language-independent.<n>In decoder-only LLMs, FVs compose these two pieces of information in two separate stages.
arXiv Detail & Related papers (2024-10-18T11:39:34Z)
Measuring Copyright Risks of Large Language Model via Partial Information Probing [14.067687792633372]
We explore the data sources used to train Large Language Models (LLMs) We input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
arXiv Detail & Related papers (2024-09-20T18:16:05Z)
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.<n>We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)<n>We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, i.e., be crosslingual? This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation [24.644101178288476]
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns. LLMs may infringe on copyrights or overly restrict non-copyrighted texts. We propose lightweight, real-time defense to prevent the generation of copyrighted text.
arXiv Detail & Related papers (2024-06-18T18:00:03Z)
LLMs and Memorization: On Quality and Specificity of Copyright Compliance [0.0]
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act.
arXiv Detail & Related papers (2024-05-28T18:01:52Z)
Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization. We present experiments with a range of language models over a collection of popular books and coding problems. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT) This paper systematically investigates the advantages and challenges of LLMs for MMT. We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.