Fugu-MT 論文翻訳(概要): Investigating LLM Variability in Personalized Conversational Information Retrieval

論文の概要: Investigating LLM Variability in Personalized Conversational Information Retrieval

arxiv url: http://arxiv.org/abs/2510.03795v1
Date: Sat, 04 Oct 2025 12:13:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.256074
Title: Investigating LLM Variability in Personalized Conversational Information Retrieval
Title（参考訳）: パーソナライズされた会話情報検索におけるLLM変数の検討
Authors: Simon Lupart, Daniël van Dijk, Eric Langezaal, Ian van Dort, Mohammad Aliannejadi,
Abstract要約: Moらは、個人用テキスト知識ベース(PTKB)を大規模言語モデル(LLM)に組み込むためのいくつかの戦略を探求した。提案手法を新しいTREC iKAT 2024データセットに適用し,Llama (1B-70B), Qwen-7B, GPT-4o-miniを含む多種多様なモデルの評価を行った。その結果,人間の選択したPTKBは連続的に検索性能を向上する一方,LLMに基づく選択法は手作業による選択を確実に上回るものではないことがわかった。
参考スコア（独自算出の注目度）: 14.220276130333849
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized Conversational Information Retrieval (CIR) has seen rapid progress in recent years, driven by the development of Large Language Models (LLMs). Personalized CIR aims to enhance document retrieval by leveraging user-specific information, such as preferences, knowledge, or constraints, to tailor responses to individual needs. A key resource for this task is the TREC iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines. Building on this resource, Mo et al. explored several strategies for incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query reformulation. Their findings suggested that personalization from PTKBs could be detrimental and that human annotations were often noisy. However, these conclusions were based on single-run experiments using the GPT-3.5 Turbo model, raising concerns about output variability and repeatability. In this reproducibility study, we rigorously reproduce and extend their work, focusing on LLM output variability and model generalization. We apply the original methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that human-selected PTKBs consistently enhance retrieval performance, while LLM-based selection methods do not reliably outperform manual choices. We further compare variance across datasets and observe higher variability on iKAT than on CAsT, highlighting the challenges of evaluating personalized CIR. Notably, recall-oriented metrics exhibit lower variance than precision-oriented ones, a critical insight for first-stage retrievers. Finally, we underscore the need for multi-run evaluations and variance reporting when assessing LLM-based CIR systems. By broadening evaluation across models, datasets, and metrics, our study contributes to more robust and generalizable practices for personalized CIR.
Abstract（参考訳）: パーソナライズされた会話情報検索(CIR)は,大規模言語モデル(LLM)の開発によって,近年急速に進展している。パーソナライズされたCIRは、好み、知識、制約といったユーザ固有の情報を活用して文書検索を強化し、個々のニーズに対する応答を調整することを目的としている。このタスクの重要なリソースは、CIRパイプラインのパーソナライズを評価するために設計されたTREC iKAT 2023データセットである。このリソースに基づいてMoらは、パーソナルテキスト知識ベース(PTKB)をLLMベースのクエリ再構成に組み込むためのいくつかの戦略を探求した。これらの結果から,PTKBのパーソナライゼーションは有害であり,ヒトのアノテーションがうるさいことが示唆された。しかし、これらの結論はGPT-3.5ターボモデルを用いた単走実験に基づいており、出力の変動性と再現性に対する懸念が高まった。本研究では, LLM出力の変動性とモデル一般化に着目し, 厳密に再現・拡張する。提案手法をTREC iKAT 2024データセットに適用し,Llama (1B-70B), Qwen-7B, GPT-4o-miniなど多種多様なモデルの評価を行った。その結果,人間の選択したPTKBは連続的に検索性能を向上する一方,LLMに基づく選択法は手作業による選択を確実に上回るものではないことがわかった。さらに、データセット間のばらつきを比較し、CAsTよりもiKATの方が高いばらつきを観察し、パーソナライズされたCIRを評価する上での課題を強調した。特に、リコール指向のメトリクスは精度指向のメトリクスよりもばらつきが低い。最後に、LCMベースのCIRシステムを評価する際に、マルチラン評価と分散レポートの必要性を強調する。モデル、データセット、メトリクスに対する評価を広げることで、パーソナライズされたCIRのためのより堅牢で一般化可能なプラクティスに寄与する。

論文の概要: Investigating LLM Variability in Personalized Conversational Information Retrieval

関連論文リスト