Fugu-MT 論文翻訳(概要): To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

論文の概要: To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

arxiv url: http://arxiv.org/abs/2605.14291v1
Date: Thu, 14 May 2026 02:49:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.589543
Title: To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
Title（参考訳）: 学ぶべきでないこと:大視領域モデルの不正な微調整からマルチモーダルデータを保護する
Authors: Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu,
Abstract要約: 我々はMMGuardを提案し、データ所有者が不正な微調整に対して積極的にマルチモーダルデータを保護できるようにする。 MMGuardはLVLMの学習力学を積極的に活用する人間の知覚できない摂動を注入することで、学習不可能な例を生成する。その結果、ホワイトボックス、グレーボックス、ブラックボックスの脅威モデルの下で、効果的で、ステルス的で、堅牢な保護効果が示された。
参考スコア（独自算出の注目度）: 21.217016062987234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.
Abstract（参考訳）: LVLM(Large Vision-Language Models)の急速な進歩は、無許可のスクレーピングとマルチモーダルなWebデータのトレーニングを伴い、データ所有者に深刻な著作権とプライバシのリスクをもたらしている。機械学習や透かしなどの既存の対策は、知的財産権侵害が既に発生した後にのみ作用する、本質的にポストホックなアプローチである。本研究では,データ所有者に対して,不正なLVLM微調整に対して,マルチモーダルデータを積極的に保護するためのMMGuardを提案する。 MMGuardはLVLMの学習力学を積極的に活用する人間の知覚できない摂動を注入することで、学習不可能な例を生成する。トレーニング損失を最小化することにより、摂動は最適化ショートカットを生成し、モデルがノイズに過度に適合し、推論中に摂動が欠如している場合に下流性能が低下する。この防御をさらに強化するため、MMGuardは相互結合破壊を導入し、LVLMの注意を戦略的にシフトさせ、理論的な保証でノイズとトレーニングターゲットの急激な相関を強制する。 6つのデータセットにわたる9つのオープンソースLVLMに対してMMGuardの評価を行った。包括的結果は,白箱,グレーボックス,ブラックボックスの脅威モデルの下で有効,盗み,堅牢な保護を実証し,攻撃的微調整による攻撃的攻撃に対して積極的に防御する上での機械的優位性を確立した。

論文の概要: To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

関連論文リスト