Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches
- URL: http://arxiv.org/abs/2404.14465v1
- Date: Mon, 22 Apr 2024 12:06:54 GMT
- Title: Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches
- Authors: Dimitris Asimopoulos, Ilias Siniosoglou, Vasileios Argyriou, Thomai Karamitsou, Eleftherios Fountoukidis, Sotirios K. Goudos, Ioannis D. Moscholios, Konstantinos E. Psannis, Panagiotis Sarigiannidis,
- Abstract summary: This paper compares the performance of transformer-based models and Large Language Models against traditional architectures for text anonymisation.
Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods.
- Score: 5.891554349884001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of data privacy, the ability to effectively anonymise text is paramount. With the proliferation of deep learning and, in particular, transformer architectures, there is a burgeoning interest in leveraging these advanced models for text anonymisation tasks. This paper presents a comprehensive benchmarking study comparing the performance of transformer-based models and Large Language Models(LLM) against traditional architectures for text anonymisation. Utilising the CoNLL-2003 dataset, known for its robustness and diversity, we evaluate several models. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods. Notably, while modern models exhibit advanced capabilities in capturing con textual nuances, certain traditional architectures still keep high performance. This work aims to guide researchers in selecting the most suitable model for their anonymisation needs, while also shedding light on potential paths for future advancements in the field.
Related papers
- Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs [25.915607750636333]
We propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling.
Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity.
arXiv Detail & Related papers (2024-10-04T01:28:56Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative Study [5.962542204378336]
This research focuses on text anonymisation methods, focusing on Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Embeddings from Language Models (ELMo) and Transformers architecture.
Preliminary results indicate that CRF, LSTM, and ELMo individually outperform traditional methods.
arXiv Detail & Related papers (2024-05-09T11:29:25Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Masked Modeling for Self-supervised Representation Learning on Vision
and Beyond [69.64364187449773]
Masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training.
We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more.
We conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research.
arXiv Detail & Related papers (2023-12-31T12:03:21Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Privacy- and Utility-Preserving NLP with Anonymized Data: A case study
of Pseudonymization [22.84767881115746]
Our work provides crucial insights into the gaps between original and anonymized data.
We make our code, pseudonymized datasets, and downstream models publicly available.
arXiv Detail & Related papers (2023-06-08T21:06:19Z) - Fine-grained Text Style Transfer with Diffusion-Based Language Models [50.02698074338317]
We trained a diffusion-based model on StylePTB dataset, the standard benchmark for fine-grained text style transfers.
Our model was able to achieve state-of-the-art performance on both individual and compositional transfers.
arXiv Detail & Related papers (2023-05-31T02:51:26Z) - Multi-Modal Experience Inspired AI Creation [33.34566822058209]
We study how to generate texts based on sequential multi-modal information.
We firstly design a multi-channel sequence-to-sequence architecture equipped with a multi-modal attention network.
We then propose a curriculum negative sampling strategy tailored for the sequential inputs.
arXiv Detail & Related papers (2022-09-02T11:50:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.