Related papers: Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

URL: http://arxiv.org/abs/2410.23841v1
Date: Thu, 31 Oct 2024 11:47:21 GMT
Title: Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models
Authors: Jianqun Zhou, Yuanlei Zheng, Wei Chen, Qianqian Zheng, Zeyuan Shang, Wei Zhang, Rui Meng, Xiaoyu Shen,
Abstract summary: This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance. We develop a novel retrieval evaluation benchmark spanning six document-level attributes. Our findings reveal that while reranking models generally surpass retrieval models in instruction following, they still face challenges in handling certain attributes.
Score: 17.202017214385826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction-following capabilities in large language models (LLMs) have significantly progressed, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings reveal that while reranking models generally surpass retrieval models in instruction following, they still face challenges in handling certain attributes. Moreover, although instruction fine-tuning and increased model size lead to better performance, most models fall short of achieving comprehensive instruction compliance as assessed by our benchmark.

Related papers

Training Plug-n-Play Knowledge Modules with Deep Context Distillation [52.94830874557649]
In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs) KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets.
arXiv Detail & Related papers (2025-03-11T01:07:57Z)
ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models [27.18321648849259]
More advanced NLP tasks require a deeper understanding of text, such as safety and factuality. We introduce a new benchmark designed to assess and highlight the limitations of embedding models trained on existing information retrieval data mixtures. We propose a novel method that reformulates these various tasks as retrieval tasks.
arXiv Detail & Related papers (2025-02-24T01:08:15Z)
Contextualizing Search Queries In-Context Learning for Conversational Rewriting with LLMs [0.0]
This paper introduces Prompt-Guided In-Context Learning, a novel approach for few-shot conversational query rewriting. Our method employs carefully designed prompts, incorporating task descriptions, input/output format specifications, and a small set of illustrative examples. Experiments on benchmark datasets, TREC and Taskmaster-1, demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-02-20T20:02:42Z)
RALLRec: Improving Retrieval Augmented Large Language Model Recommendation with Representation Learning [24.28601381739682]
Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. Existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items. We propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec)
arXiv Detail & Related papers (2025-02-10T02:15:12Z)
ICLERB: In-Context Learning Embedding and Reranker Benchmark [45.40331863265474]
In-Context Learning (ICL) enables Large Language Models to perform new tasks by conditioning on prompts with relevant information. Traditional retrieval methods focus on semantic relevance, treating retrieval as a search problem. We propose reframing retrieval for ICL as a recommendation problem, aiming to select documents that maximize utility in ICL tasks.
arXiv Detail & Related papers (2024-11-28T06:28:45Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)
SC-Rec: Enhancing Generative Retrieval with Self-Consistent Reranking for Sequential Recommendation [18.519480704213017]
We propose SC-Rec, a unified recommender system that learns diverse preference knowledge from two distinct item indices and multiple prompt templates. SC-Rec considerably outperforms the state-of-the-art methods for sequential recommendation, effectively incorporating complementary knowledge from varied outputs of the model.
arXiv Detail & Related papers (2024-08-16T11:59:01Z)
Enhancing Retrieval-Augmented LMs with a Two-stage Consistency Learning Compressor [4.35807211471107]
This work proposes a novel two-stage consistency learning approach for retrieved information compression in retrieval-augmented language models. The proposed method is empirically validated across multiple datasets, demonstrating notable enhancements in precision and efficiency for question-answering tasks.
arXiv Detail & Related papers (2024-06-04T12:43:23Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions [71.5977045423177]
We study the use of instructions in Information Retrieval systems. We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark. We show that it is possible for IR models to learn to follow complex instructions.
arXiv Detail & Related papers (2024-03-22T14:42:29Z)
Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction. We reformulate the task to be entity-centric, enabling the use of diverse metrics. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z)
RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling. Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training. Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z)
Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant. To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z)
EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA. Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)
Who Explains the Explanation? Quantitatively Assessing Feature Attribution Methods [0.0]
We propose a novel evaluation metric -- the Focus -- designed to quantify the faithfulness of explanations. We show the robustness of the metric through randomization experiments, and then use Focus to evaluate and compare three popular explainability techniques. Our results find LRP and GradCAM to be consistent and reliable, while the latter remains most competitive even when applied to poorly performing models.
arXiv Detail & Related papers (2021-09-28T07:10:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.