Fugu-MT 論文翻訳(概要): Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval

論文の概要: Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval

arxiv url: http://arxiv.org/abs/2407.11504v1
Date: Tue, 16 Jul 2024 08:42:36 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-02 00:47:18.973946
Title: Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval
Title（参考訳）: 動的Identifier予測付きブートストラッププレトレーニングによる生成検索
Authors: Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng,
Abstract要約: 生成検索は、クエリに応答して関連するドキュメント識別子を直接生成するために、識別可能な検索インデックスを使用する。近年の研究では、微調整による下流検索タスクを強化するために、慎重に訓練された事前学習タスクで訓練された強力な生成検索モデルの可能性を強調している。生成検索のためのブートストラップ付き事前学習手法であるBootRetを導入し,事前学習中に文書識別子を動的に調整し,コーパスの継続に対応する。
参考スコア（独自算出の注目度）: 108.9772640854136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative retrieval uses differentiable search indexes to directly generate relevant document identifiers in response to a query. Recent studies have highlighted the potential of a strong generative retrieval model, trained with carefully crafted pre-training tasks, to enhance downstream retrieval tasks via fine-tuning. However, the full power of pre-training for generative retrieval remains underexploited due to its reliance on pre-defined static document identifiers, which may not align with evolving model parameters. In this work, we introduce BootRet, a bootstrapped pre-training method for generative retrieval that dynamically adjusts document identifiers during pre-training to accommodate the continuing memorization of the corpus. BootRet involves three key training phases: (i) initial identifier generation, (ii) pre-training via corpus indexing and relevance prediction tasks, and (iii) bootstrapping for identifier updates. To facilitate the pre-training phase, we further introduce noisy documents and pseudo-queries, generated by large language models, to resemble semantic connections in both indexing and retrieval tasks. Experimental results demonstrate that BootRet significantly outperforms existing pre-training generative retrieval baselines and performs well even in zero-shot settings.
Abstract（参考訳）: 生成検索は、クエリに応答して関連するドキュメント識別子を直接生成するために、識別可能な検索インデックスを使用する。近年の研究では、微調整による下流検索タスクを強化するために、慎重に訓練された事前学習タスクで訓練された強力な生成検索モデルの可能性を強調している。しかし、生成的検索のための事前学習の能力は、事前定義された静的文書識別子に依存しているため、まだ未解明のままであり、それは進化するモデルパラメータと一致しない可能性がある。本稿では, コーパスの継続記憶に対応するために, 事前学習中に文書識別子を動的に調整する, 生成検索のためのブートストラップ付き事前学習手法であるBootRetを紹介する。 BootRetには3つの重要なトレーニングフェーズがある。 (i)初期識別子生成二コーパスインデックス及び関連予測タスクによる予習 (iii) 識別子更新のためのブートストラップ。事前学習を容易化するために,大規模言語モデルによって生成されるノイズの多い文書や擬似クエリを導入し,索引付けと検索の双方における意味的関係を類似させる。実験の結果,BootRetは既存の事前学習生成検索ベースラインを著しく上回り,ゼロショット設定でも良好に動作することがわかった。

論文の概要: Bootstrapped Pre-training with Dynamic Identifier Prediction for Generative Retrieval

関連論文リスト