Automated Generation of Accurate Privacy Captions From Android Source Code Using Large Language Models
- URL: http://arxiv.org/abs/2601.06276v1
- Date: Fri, 09 Jan 2026 19:41:28 GMT
- Title: Automated Generation of Accurate Privacy Captions From Android Source Code Using Large Language Models
- Authors: Vijayanta Jain, Sepideh Ghanavati, Sai Teja Peddinti, Collin McMillan,
- Abstract summary: Privacy captions are short sentences that succinctly describe what personal information is used, how it is used, and why.<n>Inaccurate captions may mislead users and expose developers to regulatory fines.<n>Existing approaches to generating privacy notices or just privacy captions include using questionnaires, templates, static analysis, or machine learning.
- Score: 2.286581990382935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Privacy captions are short sentences that succinctly describe what personal information is used, how it is used, and why, within an app. These captions can be utilized in various notice formats, such as privacy policies, app rationales, and app store descriptions. However, inaccurate captions may mislead users and expose developers to regulatory fines. Existing approaches to generating privacy notices or just privacy captions include using questionnaires, templates, static analysis, or machine learning. However, these approaches either rely heavily on developers' inputs and thus strain their efforts, use limited source code context, leading to the incomplete capture of app privacy behaviors, or depend on potentially inaccurate privacy policies as a source for creating notices. In this work, we address these limitations by developing Privacy Caption Generator (PCapGen), an approach that - i) automatically identifies and extracts large and precise source code context that implements privacy behaviors in an app, ii) uses a Large Language Model (LLM) to describe coarse- and fine-grained privacy behaviors, and iii) generates accurate, concise, and complete privacy captions to describe the privacy behaviors of the app. Our evaluation shows PCapGen generates concise, complete, and accurate privacy captions as compared to the baseline approach. Furthermore, privacy experts choose PCapGen captions at least 71\% of the time, whereas LLMs-as-judge prefer PCapGen captions at least 76\% of the time, indicating strong performance of our approach.
Related papers
- Privacy Blur: Quantifying Privacy and Utility for Image Data Release [48.64095568151945]
We show that practical implementations of Gaussian blurring are reversible enough to break privacy.<n>We take a closer look at the privacy-utility tradeoffs offered by three other obfuscation algorithms.<n> pixelization and noise addition offer both privacy and utility for a number of computer vision tasks.
arXiv Detail & Related papers (2025-12-18T02:01:17Z) - Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search [60.197239728279534]
Large language models (LLMs) in cloud-based services have raised significant privacy concerns.<n>Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility.<n>We propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness.
arXiv Detail & Related papers (2025-09-25T07:23:52Z) - Which Code Statements Implement Privacy Behaviors in Android Applications? [5.723067425160506]
A "privacy behavior" in software is an action where the software uses personal information for a service or a feature, such as a website using location to provide content relevant to a user.<n>We propose an approach to automatically detect privacy-relevant statements by fine-tuning three large language models with the data from the study.
arXiv Detail & Related papers (2025-03-03T22:20:01Z) - Activity Recognition on Avatar-Anonymized Datasets with Masked Differential Privacy [64.32494202656801]
Privacy-preserving computer vision is an important emerging problem in machine learning and artificial intelligence.<n>We present anonymization pipeline that replaces sensitive human subjects in video datasets with synthetic avatars within context.<n>We also proposeMaskDP to protect non-anonymized but privacy sensitive background information.
arXiv Detail & Related papers (2024-10-22T15:22:53Z) - PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.<n>We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.<n>State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - {A New Hope}: Contextual Privacy Policies for Mobile Applications and An
Approach Toward Automated Generation [19.578130824867596]
The aim of contextual privacy policies ( CPPs) is to fragment privacy policies into concise snippets, displaying them only within the corresponding contexts within the application's graphical user interfaces (GUIs)
In this paper, we first formulate CPP in mobile application scenario, and then present a novel multimodal framework, named SeePrivacy, specifically designed to automatically generate CPPs for mobile applications.
A human evaluation shows that 77% of the extracted privacy policy segments were perceived as well-aligned with the detected contexts.
arXiv Detail & Related papers (2024-02-22T13:32:33Z) - Towards Fine-Grained Localization of Privacy Behaviors [5.74186288696419]
PriGen uses static analysis to identify Android applications' code segments that process sensitive information.
We present the initial evaluation of our translation task for 300,000 code segments.
arXiv Detail & Related papers (2023-05-24T16:32:14Z) - PriGen: Towards Automated Translation of Android Applications' Code to
Privacy Captions [4.2534846356464815]
PriGen uses static analysis to identify Android applications' code segments which process sensitive information.
We present the initial evaluation of our translation task for $sim$300,000 code segments.
arXiv Detail & Related papers (2023-05-11T01:14:28Z) - PLUE: Language Understanding Evaluation Benchmark for Privacy Policies
in English [77.79102359580702]
We introduce the Privacy Policy Language Understanding Evaluation benchmark, a multi-task benchmark for evaluating the privacy policy language understanding.
We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training.
We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks.
arXiv Detail & Related papers (2022-12-20T05:58:32Z) - SPAct: Self-supervised Privacy Preservation for Action Recognition [73.79886509500409]
Existing approaches for mitigating privacy leakage in action recognition require privacy labels along with the action labels from the video dataset.
Recent developments of self-supervised learning (SSL) have unleashed the untapped potential of the unlabeled data.
We present a novel training framework which removes privacy information from input video in a self-supervised manner without requiring privacy labels.
arXiv Detail & Related papers (2022-03-29T02:56:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.