Altered Histories in Version Control System Repositories: Evidence from the Trenches
- URL: http://arxiv.org/abs/2509.09294v1
- Date: Thu, 11 Sep 2025 09:34:06 GMT
- Title: Altered Histories in Version Control System Repositories: Evidence from the Trenches
- Authors: Solal Rapaport, Laurent Pautet, Samuel Tardieu, Stefano Zacchiroli,
- Abstract summary: We conduct the first-scale investigation of Git history alterations in public code repositories.<n>We find history alterations in 1.22 M repositories, for a total of 8.7 M rewritten histories.<n>We introduce GitHistorian, an automated tool that developers can use to spot and describe history alterations in public Git repositories.
- Score: 4.71599202491734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Version Control Systems (VCS) like Git allow developers to locally rewrite recorded history, e.g., to reorder and suppress commits or specific data in them. These alterations have legitimate use cases, but become problematic when performed on public branches that have downstream users: they break push/pull workflows, challenge the integrity and reproducibility of repositories, and create opportunities for supply chain attackers to sneak into them nefarious changes. We conduct the first large-scale investigation of Git history alterations in public code repositories. We analyze 111 M (millions) repositories archived by Software Heritage, which preserves VCS histories even across alterations. We find history alterations in 1.22 M repositories, for a total of 8.7 M rewritten histories. We categorize changes by where they happen (which repositories, which branches) and what is changed in them (files or commit metadata). Conducting two targeted case studies we show that altered histories recurrently change licenses retroactively, or are used to remove ''secrets'' (e.g., private keys) committed by mistake. As these behaviors correspond to bad practices-in terms of project governance or security management, respectively-that software recipients might want to avoid, we introduce GitHistorian, an automated tool, that developers can use to spot and describe history alterations in public Git repositories.
Related papers
- Outcome-Conditioned Reasoning Distillation for Resolving Software Issues [49.16055123488827]
We present an Outcome-Conditioned Reasoning Distillation(O-CRD) framework that uses resolved in-repository issues with verified patches as supervision.<n>Starting from a historical fix, the method reconstructs a stage-wise repair trace backward from the verified outcome.<n>On SWE-Bench Lite, this approach increases Pass@1 by 10.4% with GPT-4o, 8.6% with DeepSeek-V3, and 10.3% with GPT-5.
arXiv Detail & Related papers (2026-01-30T18:25:39Z) - Trace: Securing Smart Contract Repository Against Access Control Vulnerability [58.02691083789239]
GitHub hosts numerous smart contract repositories containing source code, documentation, and configuration files.<n>Third-party developers often reference, reuse, or fork code from these repositories during custom development.<n>Existing tools for detecting smart contract vulnerabilities are limited in their ability to handle complex repositories.
arXiv Detail & Related papers (2025-10-22T05:18:28Z) - Improving Code Localization with Repository Memory [33.423769985220005]
We introduce tools that allow the agent to retrieve from a non-parametric memory encompassing recent historical commits and linked issues.<n>We demonstrate that augmenting such a memory can significantly improve LocAgent, a state-of-the-art localization framework.
arXiv Detail & Related papers (2025-10-01T15:10:15Z) - AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans [46.56091965723774]
Fine-tuning large language models for code editing has typically relied on mining commits and pull requests.<n>We present AgentPack, a corpus of 1.3M code edits co-authored by Claude Code, OpenAI Codex, and Cursor Agent.<n>We show that models fine-tuned on AgentPack can outperform models trained on prior human-only commit corpora.
arXiv Detail & Related papers (2025-09-26T05:28:22Z) - ChangePrism: Visualizing the Essence of Code Changes [9.321152185934105]
We present a novel visualization approach supported by a tool named ChangePrism.<n>The tool comprises two components: extraction, which retrieves code changes and relevant information from the git history, and visualization, which offers both general and detailed views of code changes in commits.<n>The general view provides an overview of different types of code changes across commits, while the detailed view displays the exact changes in the source code for each commit.
arXiv Detail & Related papers (2025-08-18T06:23:34Z) - On the Prevalence and Usage of Commit Signing on GitHub: A Longitudinal and Cross-Domain Study [1.834753484317836]
We study the presence of verified commits in GitHub repositories over five years.<n>Only 10% of all the commits in these 60 repositories are verified.<n>We propose ways to identify commit ownership based on GitHub's Events API.
arXiv Detail & Related papers (2025-04-27T12:39:50Z) - An Empirical Study of Dotfiles Repositories Containing User-Specific Configuration Files [1.7556600627464058]
Hundreds of thousands choose to publicly host their repositories on GitHub.<n>We collected and analyzed publicly-hosted dotfiles repositories on GitHub.<n>We found that 25.8% of the top 500 most-starred GitHub users maintain some form of publicly accessible dotfiles repository.
arXiv Detail & Related papers (2025-01-30T18:32:46Z) - Towards Better Comprehension of Breaking Changes in the NPM Ecosystem [12.392457751450374]
We conduct a large-scale empirical study to investigate breaking changes in the NPM ecosystem.
We construct a dataset of explicitly documented breaking changes from 381 popular NPM projects.
We yield a taxonomy of JavaScript and TypeScript-specific syntactic breaking changes and a taxonomy of major types of behavioral breaking changes.
arXiv Detail & Related papers (2024-08-26T17:18:38Z) - Language Modeling with Editable External Knowledge [90.7714362827356]
This paper introduces ERASE, which improves model behavior when new documents are acquired.
It incrementally deletes or rewriting other entries in the knowledge base each time a document is added.
It improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute.
arXiv Detail & Related papers (2024-06-17T17:59:35Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration [64.19431011897515]
This paper presents Alibaba LingmaAgent, a novel Automated Software Engineering method designed to comprehensively understand and utilize whole software repositories for issue resolution.<n>Our approach introduces a top-down method to condense critical repository information into a knowledge graph, reducing complexity, and employs a Monte Carlo tree search based strategy.<n>In production deployment and evaluation at Alibaba Cloud, LingmaAgent automatically resolved 16.9% of in-house issues faced by development engineers, and solved 43.3% of problems after manual intervention.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories [83.5195424237358]
Existing benchmarks are poorly aligned with real-world code repositories.
We propose a new benchmark named DevEval, which has three advances.
DevEval comprises 1,874 testing samples from 117 repositories, covering 10 popular domains.
arXiv Detail & Related papers (2024-05-30T09:03:42Z) - Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing [57.776971051512234]
In this work, we explore a multi-round code auto-editing setting, aiming to predict edits to a code region based on recent changes within the same.
Our model, Coeditor, is a fine-tuned language model specifically designed for code editing tasks.
In a simplified single-round, single-edit task, Coeditor significantly outperforms GPT-3.5 and SOTA open-source code completion models.
arXiv Detail & Related papers (2023-05-29T19:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.