Hiding in Plain Sight: Query Obfuscation via Random Multilingual Searches
- URL: http://arxiv.org/abs/2506.04963v1
- Date: Thu, 05 Jun 2025 12:38:08 GMT
- Title: Hiding in Plain Sight: Query Obfuscation via Random Multilingual Searches
- Authors: Anton Firc, Jan Klusáček, Kamil Malinka,
- Abstract summary: personalization can enhance relevance, it introduces privacy risks and can lead to filter bubbles.<n>This paper proposes and evaluates a lightweight, client-side query obfuscation strategy using randomly generated multilingual search queries.<n>Our findings show that while displayed search results remain largely stable, the search engine's identified user interests shift significantly under obfuscation.
- Score: 1.3654846342364308
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern search engines extensively personalize results by building detailed user profiles based on query history and behaviour. While personalization can enhance relevance, it introduces privacy risks and can lead to filter bubbles. This paper proposes and evaluates a lightweight, client-side query obfuscation strategy using randomly generated multilingual search queries to disrupt user profiling. Through controlled experiments on the Seznam.cz search engine, we assess the impact of interleaving real queries with obfuscating noise in various language configurations and ratios. Our findings show that while displayed search results remain largely stable, the search engine's identified user interests shift significantly under obfuscation. We further demonstrate that such random queries can prevent accurate profiling and overwrite established user profiles. This study provides practical evidence for query obfuscation as a viable privacy-preserving mechanism and introduces a tool that enables users to autonomously protect their search behaviour without modifying existing infrastructure.
Related papers
- Controlling What You Share: Assessing Language Model Adherence to Privacy Preferences [80.63946798650653]
We explore how users can stay in control of their data by using privacy profiles.<n>We build a framework where a local model uses these instructions to rewrite queries.<n>To support this research, we introduce a multilingual dataset of real user queries to mark private content.
arXiv Detail & Related papers (2025-07-07T18:22:55Z) - Distortion Search, A Web Search Privacy Heuristic [0.0]
Search engines have vast technical capabilities to retain Internet search logs for each user.<n>Many web search privacy enhancing tools require that the user trusts a third party.<n>We suggest a user-centric, Distortion Search, a web search query privacy methodology.
arXiv Detail & Related papers (2025-06-10T01:35:16Z) - Query Smarter, Trust Better? Exploring Search Behaviours for Verifying News Accuracy [35.07647423247397]
This study explores how different query generation strategies affect news verification and whether the way people search influences the accuracy of their information evaluation.<n>The results show that search behaviour significantly affects trust in news, with successful searches involving multiple queries yielding higher-quality results.<n>Although 'Boost' interventions had limited impact, the study suggests that interface design encouraging users to thoroughly review search results can enhance query formulation.
arXiv Detail & Related papers (2025-04-07T14:50:13Z) - Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance.<n>We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - The language of sound search: Examining User Queries in Audio Search Engines [0.2455468619225742]
Research inadequately addresses real-world user needs and behaviours in designing text-based audio retrieval systems.
To bridge this gap, we analysed search queries from two sources: a custom survey and Freesound website query logs.
Our findings indicate that survey queries are generally longer than Freesound queries, suggesting users prefer detailed queries when not limited by system constraints.
arXiv Detail & Related papers (2024-10-10T19:24:13Z) - Improving Retrieval in Sponsored Search by Leveraging Query Context Signals [6.152499434499752]
We propose an approach to enhance query understanding by augmenting queries with rich contextual signals.
We use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations.
Our context-aware approach substantially outperforms context-free models.
arXiv Detail & Related papers (2024-07-19T14:28:53Z) - AutoBencher: Towards Declarative Benchmark Construction [74.54640925146289]
We use AutoBencher to create datasets for math, multilinguality, knowledge, and safety.<n>The scalability of AutoBencher allows it to test fine-grained categories knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks.
arXiv Detail & Related papers (2024-07-11T10:03:47Z) - Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy [7.831978389504435]
When an Information Retrieval System (IRS) does not protect the privacy of its users, sensitive information may be disclosed through the queries sent to the system.
Recent improvements, especially in NLP, have shown the potential of using Differential Privacy to obfuscate texts.
We propose Word Blending Boxes, a novel differentially private mechanism for query obfuscation.
arXiv Detail & Related papers (2024-05-15T12:51:36Z) - Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A primary concern regarding training large language models (LLMs) is whether they abuse copyrighted online text.<n>We propose an alternative textitinsert-and-detect methodology, advocating that web users and content platforms employ textbftextitunique identifiers for reliable and independent membership inference.
arXiv Detail & Related papers (2024-03-23T06:36:32Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Semantics-Preserved Distortion for Personal Privacy Protection in Information Management [65.08939490413037]
This paper suggests a linguistically-grounded approach to distort texts while maintaining semantic integrity.
We present two distinct frameworks for semantic-preserving distortion: a generative approach and a substitutive approach.
We also explore privacy protection in a specific medical information management scenario, showing our method effectively limits sensitive data memorization.
arXiv Detail & Related papers (2022-01-04T04:01:05Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - Improving Query Safety at Pinterest [46.57632646205479]
PinSets is a system for query-set expansion.
It applies a simple yet powerful mechanism to search user sessions.
It expands a tiny seed set into thousands of related queries at nearly perfect precision.
arXiv Detail & Related papers (2020-06-20T07:35:22Z) - Query Intent Detection from the SEO Perspective [0.34376560669160383]
We aim to identify the user query's intent by taking advantage of Google results and machine learning methods.
A list of keywords extracted from the clustered queries is used to identify the intent of a new given query.
arXiv Detail & Related papers (2020-06-16T13:08:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.