A Multi-input Multi-output Transformer-based Hybrid Neural Network for
Multi-class Privacy Disclosure Detection
- URL: http://arxiv.org/abs/2108.08483v2
- Date: Fri, 20 Aug 2021 18:09:22 GMT
- Title: A Multi-input Multi-output Transformer-based Hybrid Neural Network for
Multi-class Privacy Disclosure Detection
- Authors: A K M Nuhil Mehdy, Hoda Mehrpouyan
- Abstract summary: In this paper, we propose a multi-input, multi-output hybrid neural network which utilizes transfer-learning, linguistics, and metadata to learn the hidden patterns.
We trained and evaluated our model on a human-annotated ground truth dataset, containing a total of 5,400 tweets.
- Score: 3.04585143845864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The concern regarding users' data privacy has risen to its highest level due
to the massive increase in communication platforms, social networking sites,
and greater users' participation in online public discourse. An increasing
number of people exchange private information via emails, text messages, and
social media without being aware of the risks and implications. Researchers in
the field of Natural Language Processing (NLP) have concentrated on creating
tools and strategies to identify, categorize, and sanitize private information
in text data since a substantial amount of data is exchanged in textual form.
However, most of the detection methods solely rely on the existence of
pre-identified keywords in the text and disregard the inference of the
underlying meaning of the utterance in a specific context. Hence, in some
situations, these tools and algorithms fail to detect disclosure, or the
produced results are miss-classified. In this paper, we propose a multi-input,
multi-output hybrid neural network which utilizes transfer-learning,
linguistics, and metadata to learn the hidden patterns. Our goal is to better
classify disclosure/non-disclosure content in terms of the context of
situation. We trained and evaluated our model on a human-annotated ground truth
dataset, containing a total of 5,400 tweets. The results show that the proposed
model was able to identify privacy disclosure through tweets with an accuracy
of 77.4% while classifying the information type of those tweets with an
impressive accuracy of 99%, by jointly learning for two separate tasks.
Related papers
- NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human [55.20137833039499]
We suggest sanitizing sensitive text using two common strategies used by humans.
We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
arXiv Detail & Related papers (2024-06-06T05:07:44Z) - When Graph Convolution Meets Double Attention: Online Privacy Disclosure Detection with Multi-Label Text Classification [6.700420953065072]
It is important to detect such unwanted privacy disclosures to help alert people affected and the online platform.
In this paper, privacy disclosure detection is modeled as a multi-label text classification problem.
A new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures.
arXiv Detail & Related papers (2023-11-27T15:25:17Z) - ManiTweet: A New Benchmark for Identifying Manipulation of News on Social Media [74.93847489218008]
We present a novel task, identifying manipulation of news on social media, which aims to detect manipulation in social media posts and identify manipulated or inserted information.
To study this task, we have proposed a data collection schema and curated a dataset called ManiTweet, consisting of 3.6K pairs of tweets and corresponding articles.
Our analysis demonstrates that this task is highly challenging, with large language models (LLMs) yielding unsatisfactory performance.
arXiv Detail & Related papers (2023-05-23T16:40:07Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Hate Speech and Offensive Language Detection using an Emotion-aware
Shared Encoder [1.8734449181723825]
Existing works on hate speech and offensive language detection produce promising results based on pre-trained transformer models.
This paper addresses a multi-task joint learning approach which combines external emotional features extracted from another corpora.
Our findings demonstrate that emotional knowledge helps to more reliably identify hate speech and offensive language across datasets.
arXiv Detail & Related papers (2023-02-17T09:31:06Z) - Countering Malicious Content Moderation Evasion in Online Social
Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems.
This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z) - Differentially Private Language Models for Secure Data Sharing [19.918137395199224]
In this paper, we show how to train a generative language model in a differentially private manner and consequently sampling data from it.
Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets.
We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality.
arXiv Detail & Related papers (2022-10-25T11:12:56Z) - Panning for gold: Lessons learned from the platform-agnostic automated
detection of political content in textual data [48.7576911714538]
We discuss how these techniques can be used to detect political content across different platforms.
We compare the performance of three groups of detection techniques relying on dictionaries, supervised machine learning, or neural networks.
Our results show the limited impact of preprocessing on model performance, with the best results for less noisy data being achieved by neural network- and machine-learning-based models.
arXiv Detail & Related papers (2022-07-01T15:23:23Z) - You Are What You Write: Preserving Privacy in the Era of Large Language
Models [2.3431670397288005]
We present an empirical investigation into the extent of the personal information encoded into pre-trained representations by a range of popular models.
We show a positive correlation between the complexity of a model, the amount of data used in pre-training, and data leakage.
arXiv Detail & Related papers (2022-04-20T11:12:53Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.