Abstract: The protection of private information is a crucial issue in data-driven
research and business contexts. Typically, techniques like anonymisation or
(selective) deletion are introduced in order to allow data sharing, \eg\ in the
case of collaborative research endeavours. For use with anonymisation
techniques, the $k$-anonymity criterion is one of the most popular, with
numerous scientific publications on different algorithms and metrics.
Anonymisation techniques often require changing the data and thus necessarily
affect the results of machine learning models trained on the underlying data.
In this work, we conduct a systematic comparison and detailed investigation
into the effects of different $k$-anonymisation algorithms on the results of
machine learning models. We investigate a set of popular $k$-anonymisation
algorithms with different classifiers and evaluate them on different real-world
datasets. Our systematic evaluation shows that with an increasingly strong
$k$-anonymity constraint, the classification performance generally degrades,
but to varying degrees and strongly depending on the dataset and anonymisation
method. Furthermore, Mondrian can be considered as the method with the most
appealing properties for subsequent classification.