Fugu-MT 論文翻訳(概要): OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

論文の概要: OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

arxiv url: http://arxiv.org/abs/2603.17205v1
Date: Tue, 17 Mar 2026 23:11:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.439566
Title: OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation
Title（参考訳）: OPERA: 効率的な検索モデル適応のためのオンラインデータプルーニング
Authors: Haoyang Fang, Shuai Zhang, Yifei Ma, Hengyi Wang, Cuixiong Hu, Katrin Kirchhoff, Bernie Wang, George Karypis,
Abstract要約: ドメイン固有の微調整は、高密度レトリバーにとって不可欠であるが、すべてのトレーニングペアが学習プロセスに等しく貢献するわけではない。我々は、この不均一性を利用して、検索モデル適応の有効性と効率を両立させるデータプルーニングフレームワークであるOPERAを紹介する。
参考スコア（独自算出の注目度）: 39.548179971747906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.
Abstract（参考訳）: ドメイン固有の微調整は、高密度レトリバーにとって不可欠であるが、すべてのトレーニングペアが学習プロセスに等しく貢献するわけではない。我々は、この不均一性を利用して、検索モデル適応の有効性と効率を両立させるデータプルーニングフレームワークであるOPERAを紹介する。まず,高い類似性を持つクエリ-ドキュメントペアのみを保持する静的プルーニング(SP)について検討し,クエリの多様性の低下により検索(リコール)が劣化するのに対して,ランク付け(NDCG)は改善されることを示す。このトレードオフを解決するため,本手法では,クエリレベルとドキュメントレベルの両方でのサンプリング確率を適応的に調整し,トレーニングセット全体へのアクセスを維持しつつ,高品質な事例を優先する2段階動的プルニング(DP)戦略を提案する。 SPは標準的な微調整(NDCG@10 +0.5\%)よりも格付けを向上し、DPはランク付け(NDCG@10 +1.9\%)と検索(Recall@20 +0.7\%)の両方で最強のパフォーマンスを達成する。これらの知見は、LLMベースの高密度検索器であるQwen3-Embeddingに拡張され、アーキテクチャに依存しない利点が確認された。特にDPは、標準的な微調整で必要とされるトレーニング時間の50分の1以下で、同等のパフォーマンスを達成している。

論文の概要: OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

関連論文リスト