DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval
- URL: http://arxiv.org/abs/2507.08360v1
- Date: Fri, 11 Jul 2025 07:23:08 GMT
- Title: DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval
- Authors: Anthony Miyaguchi, Imran Afrulbasha, Aleksandar Pramov,
- Abstract summary: The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025.<n>Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time.<n>Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05.
- Score: 44.99833362998488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information Retrieval (IR) models are often trained on static datasets, making them vulnerable to performance degradation as web content evolves. The DS@GT competition team participated in the Longitudinal Evaluation of Model Performance (LongEval) lab at CLEF 2025, which evaluates IR systems across temporally distributed web snapshots. Our analysis of the Qwant web dataset includes exploratory data analysis with topic modeling over time. The two-phase retrieval system employs sparse keyword searches, utilizing query expansion and document reranking. Our best system achieves an average NDCG@10 of 0.296 across the entire training and test dataset, with an overall best score of 0.395 on 2023-05. The accompanying source code for this paper is at https://github.com/dsgt-arc/longeval-2025
Related papers
- OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - LongEval at CLEF 2025: Longitudinal Evaluation of IR Systems on Web and Scientific Data [10.309769289748273]
LongEval lab focuses on the evaluation of information retrieval systems over time.<n>Two datasets are provided that capture evolving search scenarios with changing documents, queries, and relevance assessments.<n>We present an overview of this year's tasks and datasets, as well as the participating systems.
arXiv Detail & Related papers (2025-09-22T08:05:40Z) - Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval [0.0]
This project presents a framework for indexing and analyzing large language training datasets using an ElasticSearch-based pipeline.<n>We apply it to SwissAI's FineWeb-2 corpus, achieving fast query performance--most searches in milliseconds, all under 2 seconds.
arXiv Detail & Related papers (2025-08-29T17:04:20Z) - Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z) - Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z) - LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs [12.412316728679167]
LeetCodeDataset is a high-quality benchmark for evaluating and training code-generation models.<n>The dataset and evaluation framework are available on Hugging Face and Github.
arXiv Detail & Related papers (2025-04-20T15:28:16Z) - LongEval at CLEF 2025: Longitudinal Evaluation of IR Model Performance [5.4043491660907135]
LongEval Lab continues to explore the challenges of temporal persistence in Information Retrieval (IR)<n>By evaluating how model performance degrades as test data diverge temporally from training data, LongEval seeks to advance the understanding of temporal dynamics in IR systems.<n>The 2025 edition aims to engage the IR and NLP communities in addressing the development of adaptive models that can maintain retrieval quality over time in the domains of web search and scientific retrieval.
arXiv Detail & Related papers (2025-03-11T15:29:41Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Unified Long-Term Time-Series Forecasting Benchmark [0.6526824510982802]
We present a comprehensive dataset designed explicitly for long-term time-series forecasting.
We incorporate a collection of datasets obtained from diverse, dynamic systems and real-life records.
To determine the most effective model in diverse scenarios, we conduct an extensive benchmarking analysis using classical and state-of-the-art models.
Our findings reveal intriguing performance comparisons among these models, highlighting the dataset-dependent nature of model effectiveness.
arXiv Detail & Related papers (2023-09-27T18:59:00Z) - Benchmarking Performance of Deep Learning Model for Material
Segmentation on Two HPC Systems [0.0]
Performance data is gathered on two ERDC DSRC systems, Onyx and Vulcanite.
Vulcanite has faster model times in a large number of benchmarks, and it is also more subject to some environmental factors that can cause performances slower than Onyx.
arXiv Detail & Related papers (2023-07-27T15:03:13Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - Temporal Graph Benchmark for Machine Learning on Temporal Graphs [54.52243310226456]
Temporal Graph Benchmark (TGB) is a collection of challenging and diverse benchmark datasets.
We benchmark each dataset and find that the performance of common models can vary drastically across datasets.
TGB provides an automated machine learning pipeline for reproducible and accessible temporal graph research.
arXiv Detail & Related papers (2023-07-03T13:58:20Z) - Exploring the Practicality of Generative Retrieval on Dynamic Corpora [41.223804434693875]
In this paper, we focus on Generative Retrievals (GR), which apply autoregressive language models to IR problems.
Our results on the StreamingQA benchmark demonstrate that GR is more adaptable to evolving knowledge (4-11%), robust in learning knowledge with temporal information, and efficient in terms of FLOPs (x6), indexing time (x6), and storage footprint (x4)
Our paper highlights the potential of GR for future use in practical IR systems within dynamic environments.
arXiv Detail & Related papers (2023-05-27T16:05:00Z) - Networked Time Series Prediction with Incomplete Data [59.45358694862176]
We propose NETS-ImpGAN, a novel deep learning framework that can be trained on incomplete data with missing values in both history and future.
We conduct extensive experiments on three real-world datasets under different missing patterns and missing rates.
arXiv Detail & Related papers (2021-10-05T18:20:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.