Fugu-MT 論文翻訳(概要): When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

論文の概要: When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

arxiv url: http://arxiv.org/abs/2606.01304v2
Date: Sun, 07 Jun 2026 15:44:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:04.763014
Title: When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval
Title（参考訳）: ハード負合成における生成的識別的ギャップのブリッジ化 : 検索のためのハード負合成
Authors: Zhicheng Zhang, Jiwei Tang, Kuicai Dong, Xiaopeng Li, Jieming Zhu, Jingyu Li, Qianhui Zhu, Fengyuan Lu, Wang Jiaheng, Gang Wang, Hai-Tao Zheng, Zhaocheng Du,
Abstract要約: 比較学習に生成した負を鼻で組み込むことは、検索性能を劣化させることが多いことを示す。本分析では, 識別非依存型生成とソース依存型ショートカットの2つの複合的障害モードを明らかにした。このギャップを埋めるために、2つの主加群からなるCausalNegを提案する。
参考スコア（独自算出の注目度）: 45.5843471557695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hard negative mining has become the dominant strategy for training retrievers, yet it faces intrinsic limitations: negatives are bounded by corpus availability, selected by retriever score rather than diagnostic value, and increasingly contaminated by false positives as the retriever improves. LLM-based synthesis offers a principled alternative, where negatives that are unconstrained, targeted, and free from false positive risk. But we show that naively incorporating generated negatives into contrastive learning often degrades retrieval performance. We identify and formalize the root cause as a generative-discriminative gap: LLM generation optimizes for fluent, plausible text, while contrastive learning demands strategic violations of relevance at the decision boundary. Our analysis reveals two compounding failure modes: discriminative-agnostic generation, where the LLM lacks an explicit model of query information needs and defaults to generic or topic-drifted text that provides no contrastive signal; and source-dependent shortcuts, where distributional artifacts enable the model to distinguish negatives by origin rather than relevance, causing gradient drift that actively corrupts optimization. To close this gap, we propose CausalNeg consisting of two main modules: (1) CoT-guided counterfactual perturbation for data construction: decomposes why a document satisfies a query into explicit information requirements, then surgically violates individual requirements to construct negatives with controlled, interpretable hardness. (2) Query-view entropy maximization during training: disperses generated negatives across the similarity spectrum, minimizing the mutual information between source identity and similarity scores to suppress shortcut exploitation. We make our code publicly available at https://github.com/mzhangzhicheng/CausalNeg.
Abstract（参考訳）: 厳格な負のマイニングがリトリーバーのトレーニング戦略の主流となっているが、本質的な制限に直面している: 負はコーパスの可用性によって境界付けられ、診断値よりもレトリーバースコアによって選択され、リトリーバーが改善するにつれて偽陽性によって汚染される。 LLMベースの合成は、非拘束的で標的であり、偽陽性のリスクのない、原則化された代替手段を提供する。しかし、逆学習に生成した負を鼻で組み込むことは、検索性能を劣化させることが多い。我々は、根本原因を生成的・識別的ギャップとして同定し、形式化する: LLM生成は、流動的で可読なテキストに対して最適化する一方、対照的な学習は、決定境界における関連性の戦略的違反を要求する。 LLMはクエリ情報の明示的なモデルが欠如しており、コントラスト信号のない汎用テキストやトピックドリフトテキストに対するデフォルトが欠如しており、ソース依存のショートカットでは、分散アーティファクトが関連性よりも負の区別を可能にし、最適化を積極的に破壊する勾配ドリフトを引き起こす。このギャップを埋めるために、(1)CoT誘導によるデータ構築の反ファクト的摂動: 文書がクエリを明示的な情報要求に満足する理由を分解し、その後、個々の要求を外科的に破って、制御された、解釈可能な硬さでネガティブを構築する。 2) 学習中のクエリビューエントロピー最大化: 生成した負を類似度スペクトルに分散し, ソースアイデンティティと類似度スコアの相互情報を最小化し, ショートカット利用を抑制する。コードはhttps://github.com/mzhangzhicheng/CausalNeg.comで公開しています。

論文の概要: When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

関連論文リスト