Related papers: Corpus-Based Paraphrase Detection Experiments and Review

Corpus-Based Paraphrase Detection Experiments and Review

URL: http://arxiv.org/abs/2106.00145v1
Date: Mon, 31 May 2021 23:29:24 GMT
Title: Corpus-Based Paraphrase Detection Experiments and Review
Authors: Tedo Vrbanec and Ana Mestrovic
Abstract summary: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

Related papers

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models [5.980822697955566]
PARAPHR is a benchmark for multi-dimensional assessment, benchmarking and selection of paraphrase detection models. We find that paraphrase detection models under our fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset.
arXiv Detail & Related papers (2024-09-18T15:33:48Z)
Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD) PTD aims to identify paraphrased text spans within a text. We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z)
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. We train a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z)
Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement. Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scaling of Texts with Large Language Models [3.9940425551415597]
Existing text scaling methods often require a large corpus, struggle with short texts, or require labeled data. We develop a text scaling method that leverages the pattern recognition capabilities of generative large language models. We demonstrate how combining substantive knowledge with LLMs can create state-of-the-art measures of abstract concepts.
arXiv Detail & Related papers (2023-10-18T15:34:37Z)
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions. This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z)
Topics in the Haystack: Extracting and Evaluating Topics beyond Coherence [0.0]
We propose a method that incorporates a deeper understanding of both sentence and document themes. This allows our model to detect latent topics that may include uncommon words or neologisms. We present correlation coefficients with human identification of intruder words and achieve near-human level results at the word-intrusion task.
arXiv Detail & Related papers (2023-03-30T12:24:25Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models [4.874780144224057]
We use the capacity for memorization and extrapolation of Transformer Language Models to learn the linguistic behaviors of a subgroup within larger corpora of Yelp reviews. We show that even in cases where a specific keyphrase is limited or not present at all in the training corpora, the GPT is able to accurately generate large volumes of text that have the correct sentiment.
arXiv Detail & Related papers (2022-04-15T14:33:58Z)
Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext) Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
A Multi-cascaded Model with Data Augmentation for Enhanced Paraphrase Detection in Short Texts [1.6758573326215689]
We present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our model is both wide and deep and provides greater robustness across clean and noisy short texts.
arXiv Detail & Related papers (2019-12-27T12:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.