Large-scale online deanonymization with LLMs
- URL: http://arxiv.org/abs/2602.16800v2
- Date: Wed, 25 Feb 2026 18:37:33 GMT
- Title: Large-scale online deanonymization with LLMs
- Authors: Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr,
- Abstract summary: We show that large language models can be used to perform at-scale deanonymization.<n>With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision.
- Score: 58.46277616551135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to classical deanonymization work (e.g., on the Netflix prize) that required structured data, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.
Related papers
- De-Anonymization at Scale via Tournament-Style Attribution [15.47801233755864]
De-Anonymization at Scale (DAS) is a large language model-based method for attributing authorship among tens of thousands of candidate texts.<n>DAS can recover same-author texts from pools of tens of thousands with accuracy well above chance, demonstrating a realistic privacy risk for anonymous platforms.
arXiv Detail & Related papers (2026-01-18T13:49:43Z) - Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge [8.554894195710204]
We propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations.<n>Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history.<n>In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories.
arXiv Detail & Related papers (2025-08-12T09:23:35Z) - PPMI: Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases [38.43532939618273]
Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records.<n>Users currently face a trade-off: They can send private records to powerful but untrusted LLM providers, increasing their exposure risk.
arXiv Detail & Related papers (2025-06-19T07:13:30Z) - Automated Profile Inference with Language Model Agents [67.32226960040514]
We study a new threat that LLMs pose to online pseudonymity, called automated profile inference.<n>An adversary can instruct LLMs to automatically scrape and extract sensitive personal attributes from publicly visible user activities on pseudonymous platforms.<n>We introduce an automated profiling framework called AutoProfiler to assess the feasibility of such threats in real-world scenarios.
arXiv Detail & Related papers (2025-05-18T13:05:17Z) - Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale [53.059480071818136]
Large Language Models (LLMs) have emerged as personalized assistants for users across a wide range of tasks.<n> PERSONAMEM features curated user profiles with over 180 simulated user-LLM interaction histories.<n>We evaluate LLM chatbots' ability to identify the most suitable response according to the current state of the user's profile.
arXiv Detail & Related papers (2025-04-19T08:16:10Z) - Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering [8.20929362102942]
Author profiling is the task of inferring characteristics about individuals by analyzing content they share.<n>We propose a new method for author profiling which aims at distinguishing relevant from irrelevant content first, followed by the actual user profiling only with relevant data.<n>We evaluate our method for Big Five personality trait prediction on two Twitter corpora.
arXiv Detail & Related papers (2024-09-06T08:43:10Z) - Evaluating LLM-based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) based personal information extraction can be benchmarked.<n>LLM can be misused by attackers to accurately extract various personal information from personal profiles.<n> prompt injection can defend against strong LLM-based attacks, reducing the attack to less effective traditional ones.
arXiv Detail & Related papers (2024-08-14T04:49:30Z) - Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A primary concern regarding training large language models (LLMs) is whether they abuse copyrighted online text.<n>We propose an alternative textitinsert-and-detect methodology, advocating that web users and content platforms employ textbftextitunique identifiers for reliable and independent membership inference.
arXiv Detail & Related papers (2024-03-23T06:36:32Z) - The Looming Threat of Fake and LLM-generated LinkedIn Profiles:
Challenges and Opportunities for Detection and Prevention [0.8808993671472349]
We present a novel method for detecting fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network.
We show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings.
arXiv Detail & Related papers (2023-07-21T19:09:24Z) - A Cooperative Memory Network for Personalized Task-oriented Dialogue
Systems with Incomplete User Profiles [55.951126447217526]
We study personalized Task-oriented Dialogue Systems without assuming that user profiles are complete.
We propose a Cooperative Memory Network (CoMemNN) that has a novel mechanism to gradually enrich user profiles.
CoMemNN is able to enrich user profiles effectively, which results in an improvement of 3.06% in terms of response selection accuracy.
arXiv Detail & Related papers (2021-02-16T18:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.