PARADE: A New Dataset for Paraphrase Identification Requiring Computer
Science Domain Knowledge
- URL: http://arxiv.org/abs/2010.03725v1
- Date: Thu, 8 Oct 2020 02:01:31 GMT
- Title: PARADE: A New Dataset for Paraphrase Identification Requiring Computer
Science Domain Knowledge
- Authors: Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang and James Caverlee
- Abstract summary: PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge.
Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE.
- Score: 35.66853329610162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new benchmark dataset called PARADE for paraphrase
identification that requires specialized domain knowledge. PARADE contains
paraphrases that overlap very little at the lexical and syntactic level but are
semantically equivalent based on computer science domain knowledge, as well as
non-paraphrases that overlap greatly at the lexical and syntactic level but are
not semantically equivalent based on this domain knowledge. Experiments show
that both state-of-the-art neural models and non-expert human annotators have
poor performance on PARADE. For example, BERT after fine-tuning achieves an F1
score of 0.709, which is much lower than its performance on other paraphrase
identification datasets. PARADE can serve as a resource for researchers
interested in testing models that incorporate domain knowledge. We make our
data and code freely available.
Related papers
- SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Hierarchical Transformer Model for Scientific Named Entity Recognition [0.20646127669654832]
We present a simple and effective approach for Named Entity Recognition.
The main idea of our approach is to encode the input subword sequence with a pre-trained transformer such as BERT.
We evaluate our approach on three benchmark datasets for scientific NER.
arXiv Detail & Related papers (2022-03-28T12:59:06Z) - Can BERT Dig It? -- Named Entity Recognition for Information Retrieval
in the Archaeology Domain [3.928604516640069]
ArcheoBERTje is a BERT model pre-trained on Dutch archaeological texts.
We analyse the differences between the vocabulary and output of the BERT models on the full collection.
arXiv Detail & Related papers (2021-06-14T20:26:19Z) - A Novel Deep Learning Method for Textual Sentiment Analysis [3.0711362702464675]
This paper proposes a convolutional neural network integrated with a hierarchical attention layer to extract informative words.
The proposed model has higher classification accuracy and can extract informative words.
Applying incremental transfer learning can significantly enhance the classification performance.
arXiv Detail & Related papers (2021-02-23T12:11:36Z) - R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic
Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching.
We first employ BERT to encode the input sentences from a global perspective.
Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective.
To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z) - Delexicalized Paraphrase Generation [7.504832901086077]
We present a neural model for paraphrasing and train it to generate delexicalized sentences.
We achieve this by creating training data in which each input is paired with a number of reference paraphrases.
We show empirically that the generated paraphrases are of high quality, leading to an additional 1.29% exact match on live utterances.
arXiv Detail & Related papers (2020-12-04T18:28:30Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z) - Automatic Discovery of Novel Intents & Domains from Text Utterances [18.39942131996558]
We propose a novel framework, ADVIN, to automatically discover novel domains and intents from large volumes of unlabeled data.
ADVIN significantly outperforms baselines on three benchmark datasets, and real user utterances from a commercial voice-powered agent.
arXiv Detail & Related papers (2020-05-22T00:47:10Z) - Interpretability Analysis for Named Entity Recognition to Understand
System Predictions and How They Can Improve [49.878051587667244]
We examine the performance of several variants of LSTM-CRF architectures for named entity recognition.
We find that context representations do contribute to system performance, but that the main factor driving high performance is learning the name tokens themselves.
We enlist human annotators to evaluate the feasibility of inferring entity types from the context alone and find that, while people are not able to infer the entity type either for the majority of the errors made by the context-only system, there is some room for improvement.
arXiv Detail & Related papers (2020-04-09T14:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.