Related papers: You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

URL: http://arxiv.org/abs/2510.03761v1
Date: Sat, 04 Oct 2025 10:03:17 GMT
Title: You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
Authors: Richard A. Dubniczky, Bertalan Borsos, Tihanyi Norbert,
Abstract summary: In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence.<n>We present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions.<n>We urge the research community and repository operators to take immediate action to close these hidden security gaps.
Score: 1.0268444449457959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd

Related papers

Why Authors and Maintainers Link (or Don't Link) Their PyPI Libraries to Code Repositories and Donation Platforms [83.16077040470975]
Metadata of libraries on the Python Package Index (PyPI) plays a critical role in supporting the transparency, trust, and sustainability of open-source libraries.<n>This paper presents a large-scale empirical study combining two targeted surveys sent to 50,000 PyPI authors and maintainers.<n>We analyze more than 1,400 responses using large language model (LLM)-based topic modeling to uncover key motivations and barriers related to linking repositories and donation platforms.
arXiv Detail & Related papers (2026-01-21T16:13:57Z)
Multi-Agent Taint Specification Extraction for Vulnerability Detection [49.27772068704498]
Static Application Security Testing (SAST) tools using taint analysis are widely viewed as providing higher-quality vulnerability detection results.<n>We present SemTaint, a multi-agent system that strategically combines the semantic understanding of Large Language Models (LLMs) with traditional static program analysis.<n>We integrate SemTaint with CodeQL, a state-of-the-art SAST tool, and demonstrate its effectiveness by detecting 106 of 162 vulnerabilities previously undetectable by CodeQL.
arXiv Detail & Related papers (2026-01-15T21:31:51Z)
Evaluating Large Language Models in detecting Secrets in Android Apps [11.963737068221436]
Mobile apps often embed authentication secrets, such as API keys, tokens, and client IDs, to integrate with cloud services.<n>Developers often hardcode these credentials into Android apps, exposing them to extraction through reverse engineering.<n>We propose SecretLoc, an LLM-based approach for detecting hardcoded secrets in Android apps.
arXiv Detail & Related papers (2025-10-21T12:59:39Z)
Executable Knowledge Graphs for Replicating AI Research [65.41207324831583]
Executable Knowledge Graphs (xKG) is a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.<n>Code will released at https://github.com/zjunlp/xKG.
arXiv Detail & Related papers (2025-10-20T17:53:23Z)
ISACL: Internal State Analyzer for Copyrighted Training Data Leakage [28.435965753598875]
Large Language Models (LLMs) pose risks of inadvertently exposing copyrighted or proprietary data.<n>This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks.<n> integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements.
arXiv Detail & Related papers (2025-08-25T08:04:20Z)
Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers [61.57691030102618]
We propose a novel jailbreaking method, Paper Summary Attack (llmnamePSA)<n>It synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template.<n>Experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1.
arXiv Detail & Related papers (2025-07-17T18:33:50Z)
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks [57.589795399265945]
We introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks.<n>We also introduce BigDocs-Bench, a benchmark suite with 10 novel tasks.<n>Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o.
arXiv Detail & Related papers (2024-12-05T21:41:20Z)
Secret Breach Prevention in Software Issue Reports [4.177725820146491]
accidental exposure of sensitive information is a growing security threat.<n>This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues.<n>We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets.
arXiv Detail & Related papers (2024-10-31T06:14:17Z)
LLMDet: A Third Party Large Language Models Generated Text Detection Tool [119.0952092533317]
Large language models (LLMs) are remarkably close to high-quality human-authored text. Existing detection tools can only differentiate between machine-generated and human-authored text. We propose LLMDet, a model-specific, secure, efficient, and extendable detection tool.
arXiv Detail & Related papers (2023-05-24T10:45:16Z)
ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain [0.0]
This paper proposes a new focused crawler called ThreatCrawl.<n>It uses BiBERT-based models to classify documents and adapt its crawling path dynamically.<n>It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.
arXiv Detail & Related papers (2023-04-24T09:53:33Z)
Automatic Analysis of Available Source Code of Top Artificial Intelligence Conference Papers [9.498078340492087]
We propose a method to automatically identify papers with available source code and extract their source code repository URLs. We find that 20.5% of regular papers of 10 top AI conferences published from 2010 to 2019 are identified as papers with available source code. A large-scale comprehensive statistical analysis is made for a general picture of the source code of AI conference papers.
arXiv Detail & Related papers (2022-09-28T15:05:58Z)
Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset. We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.