Related papers: Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge

URL: http://arxiv.org/abs/2506.22644v1
Date: Fri, 27 Jun 2025 21:20:43 GMT
Title: Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge
Authors: Chase Fensore, Kaustubh Dhole, Joyce C Ho, Eugene Agichtein,
Abstract summary: We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets.<n>Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods.<n>We demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 but introduces prohibitive computational costs.
Score: 8.680958290253914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.

Related papers

RAGentA: Multi-Agent Retrieval-Augmented Generation for Attributed Question Answering [8.846547396283832]
RAGentA is a multi-agent retrieval-augmented generation (RAG) framework for attributed question answering (QA)<n>Central to the framework is a hybrid retrieval strategy that combines sparse and dense methods, improving Recall@20 by 12.5%.<n>RAGentA outperforms standard RAG baselines, achieving gains of 1.09% in correctness and 10.72% in faithfulness.
arXiv Detail & Related papers (2025-06-20T13:37:03Z)
RAGtifier: Evaluating RAG Generation Approaches of State-of-the-Art RAG Systems for the SIGIR LiveRAG Competition [0.0]
The LiveRAG 2025 challenge explores RAG solutions to maximize accuracy on DataMorgana's QA pairs.<n>The challenge provides access to sparse OpenSearch and dense Pinecone indices of the Fineweb 10BT dataset.<n>Our solution achieved a correctness score of 1.13 and a faithfulness score of 0.55, placing fourth in the SIGIR 2025 LiveRAG Challenge.
arXiv Detail & Related papers (2025-06-17T11:14:22Z)
ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge [53.18163869901266]
ESGenius is a benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social and Governance (ESG)<n> ESGenius comprises two key components: ESGenius-QA and ESGenius-Corpus.
arXiv Detail & Related papers (2025-06-02T13:19:09Z)
Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking [4.275139302875217]
We present the methodology and results of the Deep Retrieval team for subtask 4b of the CLEF CheckThat! 2025 competition.<n>We propose a hybrid retrieval pipeline that combines lexical precision, semantic generalization, and deep contextual re-ranking.<n>Our approach achieves a mean reciprocal rank at 5 (MRR@5) of 76.46% on the development set and 66.43% on the hidden test set.
arXiv Detail & Related papers (2025-05-29T08:55:39Z)
HCQA-1.5 @ Ego4D EgoSchema Challenge 2025 [77.414837862995]
We present a method that achieves third place for Ego4D Egocentric Challenge in CVPR 2025.<n>Our approach introduces a multi-source aggregation strategy to generate diverse predictions, followed by a confidence-based filtering mechanism.<n>Our method achieves 77% accuracy on over 5,000 human-curated multiple-choice questions.
arXiv Detail & Related papers (2025-05-27T02:45:14Z)
Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval [49.1574468325115]
We introduce Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones.<n>Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10.<n>More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M) remain competitive while being over 13x smaller.
arXiv Detail & Related papers (2025-05-25T23:06:20Z)
Scalable Unit Harmonization in Medical Informatics via Bayesian-Optimized Retrieval and Transformer-Based Re-ranking [0.14504054468850663]
We develop a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets.<n>We implement a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation.<n>The system achieved 83.39% precision at rank 1 and 94.66% recall at rank 5.
arXiv Detail & Related papers (2025-05-01T19:09:15Z)
From Retrieval to Generation: Comparing Different Approaches [15.31883349259767]
We evaluate retrieval-based, generation-based, and hybrid models for knowledge-intensive tasks.<n>We show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17% on NQ.<n>We also analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods.
arXiv Detail & Related papers (2025-02-27T16:29:14Z)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z)
Retrieval-Augmented Generation for Domain-Specific Question Answering: A Case Study on Pittsburgh and CMU [3.1787418271023404]
We designed a Retrieval-Augmented Generation (RAG) system to provide large language models with relevant documents for answering domain-specific questions. We extracted over 1,800 subpages using a greedy scraping strategy and employed a hybrid annotation process, combining manual and Mistral-generated question-answer pairs. Our RAG framework integrates BM25 and FAISS retrievers, enhanced with a reranker for improved document retrieval accuracy.
arXiv Detail & Related papers (2024-11-20T20:10:43Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe. We study different training aspects and methods to improve word-error-rate as well as to increase training speed. We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.