Fugu-MT 論文翻訳(概要): Quotient Semivalues for False-Name-Resistant Data Attribution

論文の概要: Quotient Semivalues for False-Name-Resistant Data Attribution

arxiv url: http://arxiv.org/abs/2605.07663v1
Date: Fri, 08 May 2026 12:34:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.042497
Title: Quotient Semivalues for False-Name-Resistant Data Attribution
Title（参考訳）: False-Name-Resistant Data Attributionのためのクオリティな半値
Authors: Florian A. D. Burnat, Brittany I. Davidson,
Abstract要約: MLデータ属性における偽名操作の形式化を行う。私たちはエビデンス支援の属性クラスタ上でShapley-、Banzhaf-、βスタイルの値を計算します。戦略的なプロバイダ攻撃下での属性のベンチマークであるDataMarket-Gymのメカニズムをインスタンス化する。
参考スコア（独自算出の注目度）: 1.253312107729806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.
Abstract（参考訳）: データ評価手法は、支払いを割り当て、トレーニングデータの機械学習パイプラインへの貢献を監査する。実際には、コントリビュータはデータセットを匿名のアイデンティティに分割したり、重複した高価値の例を作ったり、ほぼ重複するものを作ったり、あるいは共有度を高めるためにラダー合成の亜種を作ったりすることができる。我々はこれを、MLデータ属性における偽名操作として定式化する。我々の主な構成は商半値機構である: 証拠に裏付けられた帰属クラスタに対してShapley-, Banzhaf-, Beta-styleの値を計算し、正規表現演算子を用いてクラスタ内複製を吸収する。固定単調なデータ値ゲームでは、報告されたアイデンティティに対する正確なシャプリーフェア属性は、バイナリ値のインスタンスであっても、制限されない偽名保護とは相容れないことを証明し、一様反例に対して一般半値の分割ゲインを特徴付ける。このメカニズムは、false-name-neutral in-clusterアロケーションとquotient-stable operationという2つの構造条件の下で、完全に偽名保護である。不完全な証明の下では、これらの条件が概ね成立すると、操作ゲインとフェアネスの損失は3つの測定可能な量(エスケープクラスター質量、値推定誤差、クラスタリング距離)によって制限される。戦略的なプロバイダ攻撃下での属性のベンチマークであるDataMarket-Gymのメカニズムをインスタンス化する。合成分類タスクでは、サンプルレベルのエビデンスを持つ商半値は、重複およびほぼ重複したシビル攻撃の操作ゲインを、ベースラインのShapleyの下で1.74ドルから、正直なレベルに近い0.96ドルへと減少させる。 cosine-threshold と (false-merge, false-split) レートは対応するフェアネス(-Sybil frontier)をトレースする。

論文の概要: Quotient Semivalues for False-Name-Resistant Data Attribution

関連論文リスト