Fugu-MT 論文翻訳(概要): Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models

論文の概要: Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models

arxiv url: http://arxiv.org/abs/2506.00037v1
Date: Tue, 27 May 2025 14:52:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-04 21:47:32.080104
Title: Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models
Title（参考訳）: クエリドリフト補償:検索埋め込みモデルの連続学習における適合性の実現
Authors: Dipam Goswami, Liying Wang, Bartłomiej Twardowski, Joost van de Weijer,
Abstract要約: インデクシングを必要とせずに,すでにインデックス付け済みのコーパスを効果的に利用できるかを検討した。安定性を維持するために,クエリとドキュメントの埋め込みの両方に埋込み蒸留を用いる。本稿では,検索中に新しいクエリドリフト補償手法を提案し,新しいモデルクエリ埋め込みを従来の埋め込み空間に提案する。
参考スコア（独自算出の注目度）: 12.586519025284328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text embedding models enable semantic search, powering several NLP applications like Retrieval Augmented Generation by efficient information retrieval (IR). However, text embedding models are commonly studied in scenarios where the training data is static, thus limiting its applications to dynamic scenarios where new training data emerges over time. IR methods generally encode a huge corpus of documents to low-dimensional embeddings and store them in a database index. During retrieval, a semantic search over the corpus is performed and the document whose embedding is most similar to the query embedding is returned. When updating an embedding model with new training data, using the already indexed corpus is suboptimal due to the non-compatibility issue, since the model which was used to obtain the embeddings of the corpus has changed. While re-indexing of old corpus documents using the updated model enables compatibility, it requires much higher computation and time. Thus, it is critical to study how the already indexed corpus can still be effectively used without the need of re-indexing. In this work, we establish a continual learning benchmark with large-scale datasets and continually train dense retrieval embedding models on query-document pairs from new datasets in each task and observe forgetting on old tasks due to significant drift of embeddings. We employ embedding distillation on both query and document embeddings to maintain stability and propose a novel query drift compensation method during retrieval to project new model query embeddings to the old embedding space. This enables compatibility with previously indexed corpus embeddings extracted using the old model and thus reduces the forgetting. We show that the proposed method significantly improves performance without any re-indexing. Code is available at https://github.com/dipamgoswami/QDC.
Abstract（参考訳）: テキスト埋め込みモデルはセマンティック検索を可能にし、効率的な情報検索(IR)によってRetrieval Augmented GenerationのようないくつかのNLPアプリケーションを動かす。しかし、テキスト埋め込みモデルは、トレーニングデータが静的なシナリオで一般的に研究されており、新しいトレーニングデータが時間とともに出現する動的なシナリオに制限される。 IR法は通常、文書の巨大なコーパスを低次元の埋め込みにエンコードし、それらをデータベースインデックスに格納する。検索中、コーパス上のセマンティック検索を行い、埋め込みがクエリ埋め込みに最も近い文書を返送する。新たなトレーニングデータで埋め込みモデルを更新する場合、既にインデックス付けされているコーパスの使用は、コーパスの埋め込みを得るために使用されるモデルが変更されたため、非互換性の問題により、サブ最適である。更新されたモデルを使って古いコーパスドキュメントを再インデックスすることは互換性を実現するが、より高い計算と時間を必要とする。したがって、既にインデックス化されているコーパスが、再インデックス化を必要とせずに、どのように効果的に利用できるかを研究することが重要である。本研究では,大規模データセットを用いた連続学習ベンチマークを構築し,各タスクにおける新しいデータセットからクエリとドキュメントのペアに密集した検索埋め込みモデルを継続的に学習し,埋め込みの大きなドリフトによる古いタスクの忘れを観察する。クエリとドキュメントの埋め込みの両方に埋込み蒸留を用いて安定性を保ち、検索中に新しいクエリドリフト補償法を提案し、新しいモデルクエリの埋め込みを古い埋め込み空間に投影する。これにより、以前のモデルから抽出したインデックス付きコーパスの埋め込みとの互換性が実現され、忘れが軽減される。提案手法は,再インデックスを伴わずに性能を著しく向上することを示す。コードはhttps://github.com/dipamgoswami/QDC.comで入手できる。

論文の概要: Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models

関連論文リスト