Improving the Reusability of Conversational Search Test Collections
- URL: http://arxiv.org/abs/2503.09899v1
- Date: Wed, 12 Mar 2025 23:36:40 GMT
- Title: Improving the Reusability of Conversational Search Test Collections
- Authors: Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi,
- Abstract summary: Incomplete relevance judgments limit the reusability of test collections.<n>This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return.<n>We employ Large Language Models (LLMs) to fill holes by leveraging existing judgments.
- Score: 9.208308067952155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Incomplete relevance judgments limit the reusability of test collections. When new systems are compared to previous systems that contributed to the pool, they often face a disadvantage. This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return. The very nature of Conversational Search (CS) means that these holes are potentially larger and more problematic when evaluating systems. In this paper, we aim to extend CS test collections by employing Large Language Models (LLMs) to fill holes by leveraging existing judgments. We explore this problem using TREC iKAT 23 and TREC CAsT 22 collections, where information needs are highly dynamic and the responses are much more varied, leaving bigger holes to fill. Our experiments reveal that CS collections show a trend towards less reusability in deeper turns. Also, fine-tuning the Llama 3.1 model leads to high agreement with human assessors, while few-shot prompting the ChatGPT results in low agreement with humans. Consequently, filling the holes of a new system using ChatGPT leads to a higher change in the location of the new system. While regenerating the assessment pool with few-shot prompting the ChatGPT model and using it for evaluation achieves a high rank correlation with human-assessed pools. We show that filling the holes using few-shot training the Llama 3.1 model enables a fairer comparison between the new system and the systems contributed to the pool. Our hole-filling model based on few-shot training of the Llama 3.1 model can improve the reusability of test collections.
Related papers
- Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets [15.549852480638066]
We propose a novel paradigm for re-ranking called online relevance estimation.
Online relevance estimation continuously updates relevance estimates for a query throughout the ranking process.
We validate our approach on TREC benchmarks under two scenarios: hybrid retrieval and adaptive retrieval.
arXiv Detail & Related papers (2025-04-12T22:05:50Z) - Variations in Relevance Judgments and the Shelf Life of Test Collections [50.060833338921945]
paradigm shift towards neural retrieval models affected the characteristics of modern test collections.<n>We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings.<n>We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
arXiv Detail & Related papers (2025-02-28T10:46:56Z) - Enhancing Retrieval Performance: An Ensemble Approach For Hard Negative Mining [0.0]
This study focuses on explaining the crucial role of hard negatives in the training process of cross-encoder models.
We have developed a robust hard negative mining technique for efficient training of cross-encoder re-rank models on an enterprise dataset.
arXiv Detail & Related papers (2024-10-18T05:23:39Z) - Can We Use Large Language Models to Fill Relevance Judgment Holes? [9.208308067952155]
We take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes.
We find substantially lower correlates when human plus automatic judgments are used.
arXiv Detail & Related papers (2024-05-09T07:39:19Z) - MixBCT: Towards Self-Adapting Backward-Compatible Training [66.52766344751635]
We propose MixBCT, a simple yet highly effective backward-compatible training method.
We conduct experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C.
arXiv Detail & Related papers (2023-08-14T05:55:38Z) - Three Ways of Using Large Language Models to Evaluate Chat [3.7767218432589553]
This paper describes the systems submitted by team6 for ChatEval, the DSTC 11 Track 4 competition.
We present three different approaches to predicting turn-level qualities of responses based on large language models (LLMs)
We report improvement over the baseline using dynamic few-shot examples from a vector store for the prompts for ChatGPT.
An ablation study conducted after the challenge deadline shows that the new Llama 2 models are closing the performance gap between ChatGPT and open-source LLMs.
arXiv Detail & Related papers (2023-08-12T08:34:15Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Towards Teachable Reasoning Systems [29.59387051046722]
We develop a teachable reasoning system for question-answering (QA)
Our approach is three-fold: First, generated chains of reasoning show how answers are implied by the system's own internal beliefs.
Second, users can interact with the explanations to identify erroneous model beliefs and provide corrections.
Third, we augment the model with a dynamic memory of such corrections.
arXiv Detail & Related papers (2022-04-27T17:15:07Z) - Group-aware Contrastive Regression for Action Quality Assessment [85.43203180953076]
We show that the relations among videos can provide important clues for more accurate action quality assessment.
Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.
arXiv Detail & Related papers (2021-08-17T17:59:39Z) - Joint Passage Ranking for Diverse Multi-Answer Retrieval [56.43443577137929]
We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a question.
This task requires joint modeling of retrieved passages, as models should not repeatedly retrieve passages containing the same answer at the cost of missing a different valid answer.
In this paper, we introduce JPR, a joint passage retrieval model focusing on reranking. To model the joint probability of the retrieved passages, JPR makes use of an autoregressive reranker that selects a sequence of passages, equipped with novel training and decoding algorithms.
arXiv Detail & Related papers (2021-04-17T04:48:36Z) - Revisiting Deep Local Descriptor for Improved Few-Shot Classification [56.74552164206737]
We show how one can improve the quality of embeddings by leveraging textbfDense textbfClassification and textbfAttentive textbfPooling.
We suggest to pool feature maps by applying attentive pooling instead of the widely used global average pooling (GAP) to prepare embeddings for few-shot classification.
arXiv Detail & Related papers (2021-03-30T00:48:28Z) - Improving Conversational Question Answering Systems after Deployment
using Feedback-Weighted Learning [69.42679922160684]
We propose feedback-weighted learning based on importance sampling to improve upon an initial supervised system using binary user feedback.
Our work opens the prospect to exploit interactions with real users and improve conversational systems after deployment.
arXiv Detail & Related papers (2020-11-01T19:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.