Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS
- URL: http://arxiv.org/abs/2306.05083v2
- Date: Tue, 13 Jun 2023 04:56:06 GMT
- Title: Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS
- Authors: Cheng-Han Chiang, Yung-Sung Chuang, James Glass, Hung-yi Lee
- Abstract summary: It is unclear what kind of sentence pairs a sentence encoder (SE) would consider similar.
HEROS is constructed by transforming an original sentence into a new sentence based on certain rules to form a textitminimal pair
By systematically comparing the performance of over 60 supervised and unsupervised SEs on HEROS, we reveal that most unsupervised sentence encoders are insensitive to negation.
- Score: 68.34155010428941
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing sentence textual similarity benchmark datasets only use a single
number to summarize how similar the sentence encoder's decision is to humans'.
However, it is unclear what kind of sentence pairs a sentence encoder (SE)
would consider similar. Moreover, existing SE benchmarks mainly consider
sentence pairs with low lexical overlap, so it is unclear how the SEs behave
when two sentences have high lexical overlap. We introduce a high-quality SE
diagnostic dataset, HEROS. HEROS is constructed by transforming an original
sentence into a new sentence based on certain rules to form a \textit{minimal
pair}, and the minimal pair has high lexical overlaps. The rules include
replacing a word with a synonym, an antonym, a typo, a random word, and
converting the original sentence into its negation. Different rules yield
different subsets of HEROS. By systematically comparing the performance of over
60 supervised and unsupervised SEs on HEROS, we reveal that most unsupervised
sentence encoders are insensitive to negation. We find the datasets used to
train the SE are the main determinants of what kind of sentence pairs an SE
considers similar. We also show that even if two SEs have similar performance
on STS benchmarks, they can have very different behavior on HEROS. Our result
reveals the blind spot of traditional STS benchmarks when evaluating SEs.
Related papers
- Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic
Representations [102.05351905494277]
Sub-sentence encoder is a contrastively-learned contextual embedding model for fine-grained semantic representation of text.
We show that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.
arXiv Detail & Related papers (2023-11-07T20:38:30Z) - The Daunting Dilemma with Sentence Encoders: Success on Standard
Benchmarks, Failure in Capturing Basic Semantic Properties [6.747934699209743]
We evaluate five existing popular sentence encoders, i.e., Sentence-BERT, Universal Sentence (USE), LASER, Inferfer, and Doc2vec.
We propose four semantic evaluation criteria, i.e., Paraphrasing, Synonym Replacement, Antonym Replacement, and Sentence Jumbling.
We find that Sentence-Bert and USE models pass the paraphrasing criterion, with SBERT being the superior between the two.
arXiv Detail & Related papers (2023-09-07T14:42:35Z) - RankCSE: Unsupervised Sentence Representations Learning via Learning to
Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning.
It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework.
An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z) - SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning [14.028140579482688]
SimCSE surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported.
We develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation.
Results support the superiority of the proposed methods consistently.
arXiv Detail & Related papers (2022-10-08T08:07:47Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Ranking-Enhanced Unsupervised Sentence Representation Learning [32.89057204258891]
We show that semantic meaning of a sentence is determined by nearest-neighbor sentences that are similar to the input sentence.
We propose a novel unsupervised sentence encoder, RankEncoder.
arXiv Detail & Related papers (2022-09-09T14:45:16Z) - A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive
Learning Framework for Sentence Embeddings [28.046786376565123]
We propose a semantics-aware contrastive learning framework for sentence embeddings, termed Pseudo-Token BERT (PT-BERT)
We exploit the pseudo-token space (i.e., latent semantic space) representation of a sentence while eliminating the impact of superficial features such as sentence length and syntax.
Our model outperforms the state-of-the-art baselines on six standard semantic textual similarity (STS) tasks.
arXiv Detail & Related papers (2022-03-11T12:29:22Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Three Sentences Are All You Need: Local Path Enhanced Document Relation
Extraction [54.95848026576076]
We present an embarrassingly simple but effective method to select evidence sentences for document-level RE.
We have released our code at https://github.com/AndrewZhe/Three-Sentences-Are-All-You-Need.
arXiv Detail & Related papers (2021-06-03T12:29:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.