Replicability Measures for Longitudinal Information Retrieval Evaluation
- URL: http://arxiv.org/abs/2409.05417v1
- Date: Mon, 09 Sep 2024 08:19:43 GMT
- Title: Replicability Measures for Longitudinal Information Retrieval Evaluation
- Authors: Jüri Keller, Timo Breuer, Philipp Schaer,
- Abstract summary: This work explores how the effectiveness measured in evolving experiments can be assessed.
The persistency of effectiveness is investigated as a replicability task.
It was found that the most effective systems are not necessarily the ones with the most persistent performance.
- Score: 3.4917392789760147
- License:
- Abstract: Information Retrieval (IR) systems are exposed to constant changes in most components. Documents are created, updated, or deleted, the information needs are changing, and even relevance might not be static. While it is generally expected that the IR systems retain a consistent utility for the users, test collection evaluations rely on a fixed experimental setup. Based on the LongEval shared task and test collection, this work explores how the effectiveness measured in evolving experiments can be assessed. Specifically, the persistency of effectiveness is investigated as a replicability task. It is observed how the effectiveness progressively deteriorates over time compared to the initial measurement. Employing adapted replicability measures provides further insight into the persistence of effectiveness. The ranking of systems varies across retrieval measures and time. In conclusion, it was found that the most effective systems are not necessarily the ones with the most persistent performance.
Related papers
- Impact of Usability Mechanisms: A Family of Experiments on Efficiency, Effectiveness and User Satisfaction [0.5419296578793327]
We use a family of three experiments to increase the precision and generalization of the results in the baseline experiment.
We find that the Abort Operation and Preferences usability mechanisms appear to improve system usability a great deal with respect to efficiency, effectiveness and user satisfaction.
arXiv Detail & Related papers (2024-08-22T21:23:18Z) - Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability [20.797306325588153]
We highlight the gap between studying retrieval performance on static knowledge documents and understanding performance in real-world environments.
Our findings demonstrate the effectiveness of a listwise reranking approach, which proficiently handles inaccuracies induced by temporal distribution shifts.
Among listwise rerankers, our findings show that ListT5 effectively mitigates the positional bias problem by adopting the Fusion-in-Decoder architecture.
arXiv Detail & Related papers (2024-07-09T09:43:42Z) - Evaluation of Temporal Change in IR Test Collections [3.4917392789760147]
This work investigates how the temporal generalizability of effectiveness evaluations can be assessed.
We show that the proposed measures can be well adapted to describe the changes in the retrieval results.
arXiv Detail & Related papers (2024-07-01T15:25:31Z) - Unified Active Retrieval for Retrieval Augmented Generation [69.63003043712696]
In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal.
Existing active retrieval methods face two challenges: 1.
They usually rely on a single criterion, which struggles with handling various types of instructions.
They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated.
arXiv Detail & Related papers (2024-06-18T12:09:02Z) - Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues.
We propose a novel approach to address this issue at test time without requiring retraining.
MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z) - Decoy Effect In Search Interaction: Understanding User Behavior and Measuring System Vulnerability [33.78769577114657]
The study explores how decoy results alter users' interactions on search engine result pages.
It introduces the DEJA-VU metric to assess systems' susceptibility to the decoy effect.
The results show differences in systems' effectiveness and vulnerability.
arXiv Detail & Related papers (2024-03-27T11:20:48Z) - Early Period of Training Impacts Out-of-Distribution Generalization [56.283944756315066]
We investigate the relationship between learning dynamics and OOD generalization during the early period of neural network training.
We show that selecting the number of trainable parameters at different times during training has a minuscule impact on ID results.
The absolute values of sharpness and trace of Fisher Information at the initial period of training are not indicative for OOD generalization.
arXiv Detail & Related papers (2024-03-22T13:52:53Z) - REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning [64.08293076551601]
We propose a novel method of using a learned measure for identifying positive pairs.
Our Retrieval-Based Reconstruction measure measures the similarity between two sequences.
We show that the REBAR error is a predictor of mutual class membership.
arXiv Detail & Related papers (2023-11-01T13:44:45Z) - Towards Unbiased Visual Emotion Recognition via Causal Intervention [63.74095927462]
We propose a novel Emotion Recognition Network (IERN) to alleviate the negative effects brought by the dataset bias.
A series of designed tests validate the effectiveness of IERN, and experiments on three emotion benchmarks demonstrate that IERN outperforms other state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-26T10:40:59Z) - How Far Should We Look Back to Achieve Effective Real-Time Time-Series
Anomaly Detection? [1.0437764544103274]
Anomaly detection is the process of identifying unexpected events or ab-normalities in data.
RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all above-mentioned features.
It is unclear how different amounts of historical data points affect the performance of RePAD.
arXiv Detail & Related papers (2021-02-12T14:51:05Z) - ReMP: Rectified Metric Propagation for Few-Shot Learning [67.96021109377809]
A rectified metric space is learned to maintain the metric consistency from training to testing.
Numerous analyses indicate that a simple modification of the objective can yield substantial performance gains.
The proposed ReMP is effective and efficient, and outperforms the state of the arts on various standard few-shot learning datasets.
arXiv Detail & Related papers (2020-12-02T00:07:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.