Towards Effective Paraphrasing for Information Disguise
- URL: http://arxiv.org/abs/2311.05018v1
- Date: Wed, 8 Nov 2023 21:12:59 GMT
- Title: Towards Effective Paraphrasing for Information Disguise
- Authors: Anmol Agarwal, Shrey Gupta, Vamshi Bonagiri, Manas Gaur, Joseph
Reagle, Ponnurangam Kumaraguru
- Abstract summary: Research on Information Disguise (ID) becomes important when authors' written online communication pertains to sensitive domains.
We propose a framework where, for a given sentence from an author's post, we perform iterative perturbation on the sentence in the direction of paraphrasing.
Our work introduces a novel method of phrase-importance rankings using perplexity scores and involves multi-level phrase substitutions via beam search.
- Score: 13.356934367660811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information Disguise (ID), a part of computational ethics in Natural Language
Processing (NLP), is concerned with best practices of textual paraphrasing to
prevent the non-consensual use of authors' posts on the Internet. Research on
ID becomes important when authors' written online communication pertains to
sensitive domains, e.g., mental health. Over time, researchers have utilized
AI-based automated word spinners (e.g., SpinRewriter, WordAI) for paraphrasing
content. However, these tools fail to satisfy the purpose of ID as their
paraphrased content still leads to the source when queried on search engines.
There is limited prior work on judging the effectiveness of paraphrasing
methods for ID on search engines or their proxies, neural retriever (NeurIR)
models. We propose a framework where, for a given sentence from an author's
post, we perform iterative perturbation on the sentence in the direction of
paraphrasing with an attempt to confuse the search mechanism of a NeurIR system
when the sentence is queried on it. Our experiments involve the subreddit
'r/AmItheAsshole' as the source of public content and Dense Passage Retriever
as a NeurIR system-based proxy for search engines. Our work introduces a novel
method of phrase-importance rankings using perplexity scores and involves
multi-level phrase substitutions via beam search. Our multi-phrase substitution
scheme succeeds in disguising sentences 82% of the time and hence takes an
essential step towards enabling researchers to disguise sensitive content
effectively before making it public. We also release the code of our approach.
Related papers
- Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - RADAR: Robust AI-Text Detection via Adversarial Learning [69.5883095262619]
RADAR is based on adversarial training of a paraphraser and a detector.
The paraphraser's goal is to generate realistic content to evade AI-text detection.
RADAR uses the feedback from the detector to update the paraphraser, and vice versa.
arXiv Detail & Related papers (2023-07-07T21:13:27Z) - Beyond Black Box AI-Generated Plagiarism Detection: From Sentence to
Document Level [4.250876580245865]
Existing AI-generated text classifiers have limited accuracy and often produce false positives.
We propose a novel approach using natural language processing (NLP) techniques.
We generate multiple paraphrased versions of a given question and inputting them into the large language model to generate answers.
By using a contrastive loss function based on cosine similarity, we match generated sentences with those from the student's response.
arXiv Detail & Related papers (2023-06-13T20:34:55Z) - Integrity and Junkiness Failure Handling for Embedding-based Retrieval:
A Case Study in Social Network Search [26.705196461992845]
Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc.
In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine.
We define two main categories of failures introduced by it, integrity and junkiness.
arXiv Detail & Related papers (2023-04-18T20:53:47Z) - Paraphrase Identification with Deep Learning: A Review of Datasets and Methods [1.4325734372991794]
We investigate how the under-representation of certain paraphrase types in popular datasets affects the ability to detect plagiarism.
We introduce and validate a new refined typology for paraphrases.
We propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
arXiv Detail & Related papers (2022-12-13T23:06:20Z) - An Insight into The Intricacies of Lingual Paraphrasing Pragmatic
Discourse on The Purpose of Synonyms [0.0]
We develop an algorithm to paraphrase any text document or paragraphs using WordNet and Natural Language Tool Kit (NLTK)
For 250 paragraphs, our algorithm achieved a paraphrase accuracy of 94.8%.
arXiv Detail & Related papers (2022-06-07T02:57:27Z) - Phrase Retrieval Learns Passage Retrieval, Too [77.57208968326422]
We study whether phrase retrieval can serve as the basis for coarse-level retrieval including passages and documents.
We show that a dense phrase-retrieval system, without any retraining, already achieves better passage retrieval accuracy.
We also show that phrase filtering and vector quantization can reduce the size of our index by 4-10x.
arXiv Detail & Related papers (2021-09-16T17:42:45Z) - LadRa-Net: Locally-Aware Dynamic Re-read Attention Net for Sentence
Semantic Matching [66.65398852962177]
We develop a novel Dynamic Re-read Network (DRr-Net) for sentence semantic matching.
We extend DRr-Net to Locally-Aware Dynamic Re-read Attention Net (LadRa-Net)
Experiments on two popular sentence semantic matching tasks demonstrate that DRr-Net can significantly improve the performance of sentence semantic matching.
arXiv Detail & Related papers (2021-08-06T02:07:04Z) - Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals [69.76097138157816]
Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
arXiv Detail & Related papers (2021-07-12T20:47:08Z) - A Replication Study of Dense Passage Retriever [32.192420072129636]
We study the dense passage retriever (DPR) technique proposed by Karpukhin et al. ( 2020) for end-to-end open-domain question answering.
We present a replication study of this work, starting with model checkpoints provided by the authors.
We are able to improve end-to-end question answering effectiveness using exactly the same models as in the original work.
arXiv Detail & Related papers (2021-04-12T18:10:39Z) - CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web
to Special Domain Search [89.48123965553098]
This paper presents a search system to alleviate the special domain adaption problem.
The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy.
Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task.
arXiv Detail & Related papers (2020-11-03T09:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.