Related papers: Reproducing NevIR: Negation in Neural Information Retrieval

Reproducing NevIR: Negation in Neural Information Retrieval

URL: http://arxiv.org/abs/2502.13506v3
Date: Thu, 01 May 2025 07:27:34 GMT
Title: Reproducing NevIR: Negation in Neural Information Retrieval
Authors: Coen van den Elsen, Francien Barkhof, Thijmen Nijdam, Simon Lupart, Mohammad Aliannejadi,
Abstract summary: Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models in Information Retrieval (IR)<n>We reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation.<n>Our findings show that a recently emerging category-listwise Large Language Model (LLM) re-rankers-outperforms other models but still underperforms human performance.
Score: 5.950812862331131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Negation is a fundamental aspect of human communication, yet it remains a challenge for Language Models (LMs) in Information Retrieval (IR). Despite the heavy reliance of modern neural IR systems on LMs, little attention has been given to their handling of negation. In this study, we reproduce and extend the findings of NevIR, a benchmark study that revealed most IR models perform at or below the level of random ranking when dealing with negation. We replicate NevIR's original experiments and evaluate newly developed state-of-the-art IR models. Our findings show that a recently emerging category-listwise Large Language Model (LLM) re-rankers-outperforms other models but still underperforms human performance. Additionally, we leverage ExcluIR, a benchmark dataset designed for exclusionary queries with extensive negation, to assess the generalisability of negation understanding. Our findings suggest that fine-tuning on one dataset does not reliably improve performance on the other, indicating notable differences in their data distributions. Furthermore, we observe that only cross-encoders and listwise LLM re-rankers achieve reasonable performance across both negation tasks.

Related papers

A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers [61.086220009192424]
We introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions.<n>We generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models.<n>We propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets.
arXiv Detail & Related papers (2025-07-30T02:44:20Z)
Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models [25.346800371132645]
Retrieval-augmented Large Language Models (RALMs) do not consistently outperform the original retrieval-free Language Models (LMs) Our experiments reveal that this example-level performance inconsistency exists not only between retrieval-augmented and retrieval-free LM but also among different retrievers. We introduce Ensemble of Retrievers (EoR), a trainable framework that can adaptively retrieve from different knowledge sources and effectively decrease unpredictable reader errors.
arXiv Detail & Related papers (2024-05-31T08:22:49Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG. We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z)
Evaluating Machine Learning Models with NERO: Non-Equivariance Revealed on Orbits [19.45052971156096]
We propose a novel evaluation workflow, named Non-Equivariance Revealed on Orbits (NERO) Evaluation. NERO evaluation is consist of a task-agnostic interactive interface and a set of visualizations, called NERO plots. Case studies on how NERO evaluation can be applied to multiple research areas, including 2D digit recognition, object detection, particle image velocimetry (PIV), and 3D point cloud classification.
arXiv Detail & Related papers (2023-05-31T14:24:35Z)
NevIR: Negation in Neural Information Retrieval [45.9442701147499]
Negation is a common everyday phenomenon and has been a consistent area of weakness for language models (LMs) We construct a benchmark asking IR models to rank two documents that differ only by negation. We show that the results vary widely according to the type of IR architecture: cross-encoders perform best, followed by late-interaction models, and in last place are bi-encoder and sparse neural architectures.
arXiv Detail & Related papers (2023-05-12T17:05:54Z)
Improving negation detection with negation-focused pre-training [58.32362243122714]
Negation is a common linguistic feature that is crucial in many language understanding tasks. Recent work has shown that state-of-the-art NLP models underperform on samples containing negation. We propose a new negation-focused pre-training strategy, involving targeted data augmentation and negation masking.
arXiv Detail & Related papers (2022-05-09T02:41:11Z)
Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval [51.53892300802014]
We show that supervised neural information retrieval models are prone to learning sparse attention patterns over passage tokens. Using a novel targeted synthetic data generation method, we teach neural IR to attend more uniformly and robustly to all entities in a given passage.
arXiv Detail & Related papers (2022-04-24T22:36:48Z)
NADE: A Benchmark for Robust Adverse Drug Events Extraction in Face of Negations [8.380439657099906]
Adverse Drug Event (ADE) extraction mod-els can rapidly examine large collections of so-cial media texts, detecting mentions of drug-related adverse reactions and trigger medicalinvestigations. Despite the recent ad-vances in NLP, it is currently unknown if suchmodels are robust in face ofnegation, which ispervasive across language varieties. In this paper we evaluate three state-of-the-art systems, showing their fragility against nega-tion, and then we introduce two possible strate-gies to increase the robustness of these mod-els.
arXiv Detail & Related papers (2021-09-21T10:33:29Z)
Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks. Their performance degrades considerably on adversarial or out-of-distribution samples. We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.