Related papers: Evaluating Embedding Models and Pipeline Optimization for AI Search Quality

Evaluating Embedding Models and Pipeline Optimization for AI Search Quality

URL: http://arxiv.org/abs/2511.22240v1
Date: Thu, 27 Nov 2025 09:09:39 GMT
Title: Evaluating Embedding Models and Pipeline Optimization for AI Search Quality
Authors: Philip Zhong, Kent Chen, Don Wang,
Abstract summary: We evaluate the performance of various text embedding models and pipeline configurations for AI-driven search systems.<n>A custom evaluation dataset of 11,975 query-chunk pairs was synthesized from US City Council meeting transcripts.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We evaluate the performance of various text embedding models and pipeline configurations for AI-driven search systems. We compare sentence-transformer and generative embedding models (e.g., All-MPNet, BGE, GTE, and Qwen) at different dimensions, indexing methods (Milvus HNSW/IVF), and chunking strategies. A custom evaluation dataset of 11,975 query-chunk pairs was synthesized from US City Council meeting transcripts using a local large language model (LLM). The data pipeline includes preprocessing, automated question generation per chunk, manual validation, and continuous integration/continuous deployment (CI/CD) integration. We measure retrieval accuracy using reference-based metrics: Top-K Accuracy and Normalized Discounted Cumulative Gain (NDCG). Our results demonstrate that higher-dimensional embeddings significantly boost search quality (e.g., Qwen3-Embedding-8B/4096 achieves Top-3 accuracy about 0.571 versus 0.412 for GTE-large/1024), and that neural re-rankers (e.g., a BGE cross-encoder) further improve ranking accuracy (Top-3 up to 0.527). Finer-grained chunking (512 characters versus 2000 characters) also improves accuracy. We discuss the impact of these factors and outline future directions for pipeline automation and evaluation.

Related papers

AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators [57.003100107659684]
AutoMetrics is a framework for synthesizing evaluation metrics under low-data constraints.<n>We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward.
arXiv Detail & Related papers (2025-12-19T06:32:46Z)
Detect Anything via Next Point Prediction [51.55967987350882]
Rex- Omni is a 3B-scale MLLM that achieves state-of-the-art object perception performance.<n>On benchmarks like COCO and LVIS, Rex- Omni attains performance comparable to or exceeding regression-based models.
arXiv Detail & Related papers (2025-10-14T17:59:54Z)
Open-Source Agentic Hybrid RAG Framework for Scientific Literature Review [2.092154729589438]
We present an agentic approach that encapsulates the hybrid RAG pipeline within an autonomous agent.<n>Our pipeline ingests bibliometric open-access data from PubMed, arXiv, and Google Scholar APIs.<n>A Llama-3.3-70B agent selects GraphRAG (translating queries to Cypher for KG) or VectorRAG (combining sparse and dense retrieval with re-ranking)
arXiv Detail & Related papers (2025-07-30T18:54:15Z)
Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models [0.6827423171182154]
Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains.<n>We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance.<n>Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42%.
arXiv Detail & Related papers (2025-02-21T06:38:57Z)
FlowTS: Time Series Generation via Rectified Flow [67.41208519939626]
FlowTS is an ODE-based model that leverages rectified flow with straight-line transport in probability space.<n>For unconditional setting, FlowTS achieves state-of-the-art performance, with context FID scores of 0.019 and 0.011 on Stock and ETTh datasets.<n>For conditional setting, we have achieved superior performance in solar forecasting.
arXiv Detail & Related papers (2024-11-12T03:03:23Z)
OPUS: Occupancy Prediction Using a Sparse Set [64.60854562502523]
We present a framework to simultaneously predict occupied locations and classes using a set of learnable queries. OPUS incorporates a suite of non-trivial strategies to enhance model performance. Our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6.1 RayIoU.
arXiv Detail & Related papers (2024-09-14T07:44:22Z)
Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG [1.8448587047759064]
This paper benchmarks various publicly available ranking models and examines their impact on ranking accuracy. We focus on text retrieval for question-answering tasks, a common use case for Retrieval-Augmented Generation systems. We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3, which achieves a significant accuracy increase of 14% compared to pipelines with other rerankers.
arXiv Detail & Related papers (2024-09-12T01:51:06Z)
NUDGE: Lightweight Non-Parametric Fine-Tuning of Embeddings for Retrieval [0.7646713951724011]
Existing approaches either fine-tune the pre-trained model itself or, more efficiently, train adaptor models to transform the output of the pre-trained model. We present NUDGE, a family of novel non-parametric embedding fine-tuning approaches. NUDGE directly modifies the embeddings of data records to maximize the accuracy of $k$-NN retrieval.
arXiv Detail & Related papers (2024-09-04T00:10:36Z)
Approximation-Aware Bayesian Optimization [34.56666383247348]
High-dimensional Bayesian optimization (BO) tasks often require 10,000 function evaluations before obtaining meaningful results.<n>We modify sparse variational Gaussian processes (SVGPs) to better align with the goals of BO.<n>Using the framework of utility-calibrated variational inference, we unify GP approximation and data acquisition into a joint optimization problem.
arXiv Detail & Related papers (2024-06-06T17:55:02Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone. T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.