Evaluating D-MERIT of Partial-annotation on Information Retrieval
- URL: http://arxiv.org/abs/2406.16048v2
- Date: Sun, 13 Oct 2024 15:30:32 GMT
- Title: Evaluating D-MERIT of Partial-annotation on Information Retrieval
- Authors: Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg,
- Abstract summary: Retrieval models are often evaluated on partially-annotated datasets.
We show that using partially-annotated datasets in evaluation can paint a distorted picture.
- Score: 77.44452769932676
- License:
- Abstract: Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., "journals about linguistics") and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that "Language" is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.
Related papers
- On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Podcast Summary Assessment: A Resource for Evaluating Summary Assessment
Methods [42.08097583183816]
We describe a new dataset, the podcast summary assessment corpus.
This dataset has two unique aspects: (i) long-input, speech podcast based, documents; and (ii) an opportunity to detect inappropriate reference summaries in podcast corpus.
arXiv Detail & Related papers (2022-08-28T18:24:41Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - Reference and Document Aware Semantic Evaluation Methods for Korean
Language Summarization [6.826626737986031]
We propose evaluation metrics that reflect semantic meanings of a reference summary and the original document.
We then propose a method for improving the correlation of the metrics with human judgment.
arXiv Detail & Related papers (2020-04-29T08:26:30Z) - Extending Text Informativeness Measures to Passage Interestingness
Evaluation (Language Model vs. Word Embedding) [1.2998637003026272]
This paper defines the concept of Interestingness as a generalization of Informativeness.
We then study the ability of state of the art Informativeness measures to cope with this generalization.
We prove that the CLEF-INEX Tweet Contextualization 2012 Logarithm Similarity measure provides best results.
arXiv Detail & Related papers (2020-04-14T18:22:48Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.