Corpus-Based Paraphrase Detection Experiments and Review
- URL: http://arxiv.org/abs/2106.00145v1
- Date: Mon, 31 May 2021 23:29:24 GMT
- Title: Corpus-Based Paraphrase Detection Experiments and Review
- Authors: Tedo Vrbanec and Ana Mestrovic
- Abstract summary: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, etc.
In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Paraphrase detection is important for a number of applications, including
plagiarism detection, authorship attribution, question answering, text
summarization, text mining in general, etc. In this paper, we give a
performance overview of various types of corpus-based models, especially deep
learning (DL) models, with the task of paraphrase detection. We report the
results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO,
and USE) evaluated on three different public available corpora: Microsoft
Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase
Corpus 2011. Through a great number of experiments, we decided on the most
appropriate approaches for text pre-processing: hyper-parameters, sub-model
selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and
semantic similarity/paraphrase detection threshold. Our findings and those of
other researchers who have used deep learning models show that DL models are
very competitive with traditional state-of-the-art approaches and have
potential that should be further developed.
Related papers
- Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement.
Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison
Scaling of Texts with Large Language Models [3.9940425551415597]
Existing text scaling methods often require a large corpus, struggle with short texts, or require labeled data.
We develop a text scaling method that leverages the pattern recognition capabilities of generative large language models.
We demonstrate how combining substantive knowledge with LLMs can create state-of-the-art measures of abstract concepts.
arXiv Detail & Related papers (2023-10-18T15:34:37Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Topics in the Haystack: Extracting and Evaluating Topics beyond
Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes.
This allows our model to detect latent topics that may include uncommon words or neologisms.
We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Polling Latent Opinions: A Method for Computational Sociolinguistics
Using Transformer Language Models [4.874780144224057]
We use the capacity for memorization and extrapolation of Transformer Language Models to learn the linguistic behaviors of a subgroup within larger corpora of Yelp reviews.
We show that even in cases where a specific keyphrase is limited or not present at all in the training corpora, the GPT is able to accurately generate large volumes of text that have the correct sentiment.
arXiv Detail & Related papers (2022-04-15T14:33:58Z) - Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext)
Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z) - A Multi-cascaded Model with Data Augmentation for Enhanced Paraphrase
Detection in Short Texts [1.6758573326215689]
We present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts.
Our model is both wide and deep and provides greater robustness across clean and noisy short texts.
arXiv Detail & Related papers (2019-12-27T12:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.