LongEval at CLEF 2025: Longitudinal Evaluation of IR Model Performance
- URL: http://arxiv.org/abs/2503.08541v1
- Date: Tue, 11 Mar 2025 15:29:41 GMT
- Title: LongEval at CLEF 2025: Longitudinal Evaluation of IR Model Performance
- Authors: Matteo Cancellieri, Alaa El-Ebshihy, Tobias Fink, Petra Galuščáková, Gabriela Gonzalez-Saez, Lorraine Goeuriot, David Iommi, Jüri Keller, Petr Knoth, Philippe Mulhem, Florina Piroi, David Pride, Philipp Schaer,
- Abstract summary: LongEval Lab continues to explore the challenges of temporal persistence in Information Retrieval (IR)<n>By evaluating how model performance degrades as test data diverge temporally from training data, LongEval seeks to advance the understanding of temporal dynamics in IR systems.<n>The 2025 edition aims to engage the IR and NLP communities in addressing the development of adaptive models that can maintain retrieval quality over time in the domains of web search and scientific retrieval.
- Score: 5.4043491660907135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the third edition of the LongEval Lab, part of the CLEF 2025 conference, which continues to explore the challenges of temporal persistence in Information Retrieval (IR). The lab features two tasks designed to provide researchers with test data that reflect the evolving nature of user queries and document relevance over time. By evaluating how model performance degrades as test data diverge temporally from training data, LongEval seeks to advance the understanding of temporal dynamics in IR systems. The 2025 edition aims to engage the IR and NLP communities in addressing the development of adaptive models that can maintain retrieval quality over time in the domains of web search and scientific retrieval.
Related papers
- Document Reconstruction Unlocks Scalable Long-Context RLVR [60.74632963522131]
Reinforcement Learning with Verifiable Rewards(RLVR) has become a prominent paradigm to enhance the capabilities (i.e. long-context) of Large Language Models(LLMs)<n>We investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision.<n>We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBenchv2.
arXiv Detail & Related papers (2026-02-09T03:23:23Z) - ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval [11.41528830724814]
We present ScholarGym, a simulation environment for reproducible evaluation of deep research on academic literature.<n>Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth.
arXiv Detail & Related papers (2026-01-29T12:51:44Z) - Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent [52.876617746453995]
Dr.Mi-Bench is a Modular-integrated benchmark for scientific deep research (DR) agents.<n>Dr.Mi-Eval is a novel modular-integrated evaluation paradigm.
arXiv Detail & Related papers (2025-11-30T17:16:47Z) - LongEval at CLEF 2025: Longitudinal Evaluation of IR Systems on Web and Scientific Data [10.309769289748273]
LongEval lab focuses on the evaluation of information retrieval systems over time.<n>Two datasets are provided that capture evolving search scenarios with changing documents, queries, and relevance assessments.<n>We present an overview of this year's tasks and datasets, as well as the participating systems.
arXiv Detail & Related papers (2025-09-22T08:05:40Z) - Characterizing Deep Research: A Benchmark and Formal Definition [24.523394260858822]
We propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems.<n>We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process.
arXiv Detail & Related papers (2025-08-06T08:09:28Z) - DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval [44.99833362998488]
The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025.<n>Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time.<n>Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05.
arXiv Detail & Related papers (2025-07-11T07:23:08Z) - Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science? [2.769064123193329]
We present a new information retrieval task to identify concept equivalence across question and response options.
This paper investigates multiple unsupervised approaches on a survey dataset spanning 1946-2020.
We show that IR-specialised neural models achieve the highest overall performance with other approaches performing comparably.
arXiv Detail & Related papers (2025-04-29T12:00:33Z) - Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z) - TSFeatLIME: An Online User Study in Enhancing Explainability in Univariate Time Series Forecasting [1.9314780151274307]
This paper presents a framework - TSFeatLIME, extending TSLIME.
TSFeatLIME integrates an auxiliary feature into the surrogate model and considers the pairwise Euclidean distances between the queried time series and the generated samples.
Results show that the surrogate model under the TSFeatLIME framework is able to better simulate the behaviour of the black-box considering distance, without sacrificing accuracy.
arXiv Detail & Related papers (2024-09-24T10:24:53Z) - Robust Neural Information Retrieval: An Adversarial and Out-of-distribution Perspective [111.58315434849047]
robustness of neural information retrieval models (IR) models has garnered significant attention.
We view the robustness of IR to be a multifaceted concept, emphasizing its necessity against adversarial attacks, out-of-distribution (OOD) scenarios and performance variance.
We provide an in-depth discussion of existing methods, datasets, and evaluation metrics, shedding light on challenges and future directions in the era of large language models.
arXiv Detail & Related papers (2024-07-09T16:07:01Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - On the Resurgence of Recurrent Models for Long Sequences -- Survey and
Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence.
It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z) - Can LMs Generalize to Future Data? An Empirical Analysis on Text
Summarization [50.20034493626049]
Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets.
Existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets.
We show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data.
arXiv Detail & Related papers (2023-05-03T08:08:07Z) - Continual Learning of Long Topic Sequences in Neural Information
Retrieval [2.3846478553599098]
We first propose a dataset based upon the MSMarco corpus aiming at modeling a long stream of topics.
We then in-depth analyze the ability of recent neural IR models while continually learning those streams.
arXiv Detail & Related papers (2022-01-10T14:19:09Z) - Deep learning for temporal data representation in electronic health
records: A systematic review of challenges and methodologies [11.584972135829199]
Temporal electronic health records can be a wealth of information for secondary uses, such as clinical events prediction or chronic disease management.
We sought articles that reported deep learning methodologies on temporal data representation in structured EHR data from January 1, 2010, to August 30, 2020.
Four major challenges were identified, including data irregularity, data heterogeneity, data sparsity, and model opacity.
arXiv Detail & Related papers (2021-07-21T09:00:40Z) - Two-Stream Consensus Network: Submission to HACS Challenge 2021
Weakly-Supervised Learning Track [78.64815984927425]
The goal of weakly-supervised temporal action localization is to temporally locate and classify action of interest in untrimmed videos.
We adopt the two-stream consensus network (TSCN) as the main framework in this challenge.
Our solution ranked 2rd in this challenge, and we hope our method can serve as a baseline for future academic research.
arXiv Detail & Related papers (2021-06-21T03:36:36Z) - Deep Learning-Based Human Pose Estimation: A Survey [66.01917727294163]
Human pose estimation has drawn increasing attention during the past decade.
It has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality.
Recent deep learning-based solutions have achieved high performance in human pose estimation.
arXiv Detail & Related papers (2020-12-24T18:49:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.