Semantic Search as Extractive Paraphrase Span Detection
- URL: http://arxiv.org/abs/2112.04886v1
- Date: Thu, 9 Dec 2021 13:16:42 GMT
- Title: Semantic Search as Extractive Paraphrase Span Detection
- Authors: Jenna Kanerva, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias
Creutz and Filip Ginter
- Abstract summary: We frame the problem of semantic search by framing the search task as paraphrase span detection.
On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs, we find that our paraphrase span detection model outperforms two strong retrieval baselines.
We introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources are not available.
- Score: 0.8137055256093007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we approach the problem of semantic search by framing the
search task as paraphrase span detection, i.e. given a segment of text as a
query phrase, the task is to identify its paraphrase in a given document, the
same modelling setup as typically used in extractive question answering. On the
Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs
including their original document context, we find that our paraphrase span
detection model outperforms two strong retrieval baselines (lexical similarity
and BERT sentence embeddings) by 31.9pp and 22.4pp respectively in terms of
exact match, and by 22.3pp and 12.9pp in terms of token-level F-score. This
demonstrates a strong advantage of modelling the task in terms of span
retrieval, rather than sentence similarity. Additionally, we introduce a method
for creating artificial paraphrase data through back-translation, suitable for
languages where manually annotated paraphrase resources for training the span
detection model are not available.
Related papers
- Spotting AI's Touch: Identifying LLM-Paraphrased Spans in Text [61.22649031769564]
We propose a novel framework, paraphrased text span detection (PTD)
PTD aims to identify paraphrased text spans within a text.
We construct a dedicated dataset, PASTED, for paraphrased text span detection.
arXiv Detail & Related papers (2024-05-21T11:22:27Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - SimCKP: Simple Contrastive Learning of Keyphrase Representations [36.88517357720033]
We propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; and 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document.
arXiv Detail & Related papers (2023-10-12T11:11:54Z) - Description-Based Text Similarity [59.552704474862004]
We identify the need to search for texts based on abstract descriptions of their content.
We propose an alternative model that significantly improves when used in standard nearest neighbor search.
arXiv Detail & Related papers (2023-05-21T17:14:31Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - Applying Transformer-based Text Summarization for Keyphrase Generation [2.28438857884398]
Keyphrases are crucial for searching and systematizing scholarly documents.
In this paper, we experiment with popular transformer-based models for abstractive text summarization.
We show that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERT.Score.
We also investigate several ordering strategies to target keyphrases.
arXiv Detail & Related papers (2022-09-08T13:01:52Z) - Improving Paraphrase Detection with the Adversarial Paraphrasing Task [0.0]
Paraphrasing datasets currently rely on a sense of paraphrase based on word overlap and syntax.
We introduce a new adversarial method of dataset creation for paraphrase identification: the Adversarial Paraphrasing Task (APT)
APT asks participants to generate semantically equivalent (in the sense of mutually implicative) but lexically and syntactically disparate paraphrases.
arXiv Detail & Related papers (2021-06-14T18:15:20Z) - Keyphrase Extraction with Span-based Feature Representations [13.790461555410747]
Keyphrases are capable of providing semantic metadata characterizing documents.
Three approaches to address keyphrase extraction: (i) traditional two-step ranking method, (ii) sequence labeling and (iii) generation using neural networks.
In this paper, we propose a novelty Span Keyphrase Extraction model that extracts span-based feature representation of keyphrase directly from all the content tokens.
arXiv Detail & Related papers (2020-02-13T09:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.