Related papers: Pointwise Paraphrase Appraisal is Potentially Problematic

Pointwise Paraphrase Appraisal is Potentially Problematic

URL: http://arxiv.org/abs/2005.11996v2
Date: Fri, 5 Jun 2020 03:18:28 GMT
Title: Pointwise Paraphrase Appraisal is Potentially Problematic
Authors: Hannah Chen, Yangfeng Ji, David Evans
Abstract summary: We show that the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance. We also show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.
Score: 21.06607915149245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The prevailing approach for training and evaluating paraphrase identification models is constructed as a binary classification problem: the model is given a pair of sentences, and is judged by how accurately it classifies pairs as either paraphrases or non-paraphrases. This pointwise-based evaluation method does not match well the objective of most real world applications, so the goal of our work is to understand how models which perform well under pointwise evaluation may fail in practice and find better methods for evaluating paraphrase identification models. As a first step towards that goal, we show that although the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance, that model may perform poorly on simple tasks like identifying pairs with two identical sentences. Moreover, we show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.

Related papers

Semi-supervised Learning For Robust Speech Evaluation [30.593420641501968]
Speech evaluation measures a learners oral proficiency using automatic models. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization. An anchor model is trained using pseudo labels to predict the correctness of pronunciation.
arXiv Detail & Related papers (2024-09-23T02:11:24Z)
PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models [5.980822697955566]
Paraphrasus is a benchmark designed for multi-dimensional assessment of paraphrase detection models. We find that paraphrase detection models under a fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset.
arXiv Detail & Related papers (2024-09-18T15:33:48Z)
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. We train a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z)
Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation? [8.51696622847778]
A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers.
arXiv Detail & Related papers (2023-11-21T04:06:08Z)
Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z)
Assessing Word Importance Using Models Trained for Semantic Tasks [0.0]
We derive word significance from models trained to solve semantic task: Natural Language Inference and Paraphrase Identification. We evaluate their relevance using a so-called cross-task evaluation. Our method can be used to identify important words in sentences without any explicit word importance labeling in training.
arXiv Detail & Related papers (2023-05-31T09:34:26Z)
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem. Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem. We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z)
Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning [30.09715149060206]
Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document. In this paper, we propose a new fine-grained evaluation metric that considers different granularity. For learning more recessive linguistic patterns, we use a pre-trained model (e.g., BERT) to compute the continuous similarity score between predicted keyphrases and target keyphrases.
arXiv Detail & Related papers (2021-04-18T10:13:46Z)
Example-Driven Intent Prediction with Observers [15.615065041164629]
We focus on the intent classification problem which aims to identify user intents given utterances addressed to the dialog system. We propose two approaches for improving the generalizability of utterance classification models: (1) observers and (2) example-driven training.
arXiv Detail & Related papers (2020-10-17T01:03:06Z)
Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words" Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)
Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading [96.48553941812366]
Lip-reading aims to infer the speech content from the lip movement sequence. Traditional learning process of seq2seq models suffers from two problems. We propose a novel pseudo-convolutional policy gradient (PCPG) based method to address these two problems.
arXiv Detail & Related papers (2020-03-09T09:12:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.