EVOSCAT: Exploring Software Change Dynamics in Large-Scale Historical Datasets
- URL: http://arxiv.org/abs/2508.10852v1
- Date: Thu, 14 Aug 2025 17:20:27 GMT
- Title: EVOSCAT: Exploring Software Change Dynamics in Large-Scale Historical Datasets
- Authors: Souhaila Serbout, Diana Carolina Muñoz Hurtado, Hassan Atwi, Edoardo Riggio, Cesare Pautasso,
- Abstract summary: Long lived software projects encompass a large number of artifacts, which undergo many revisions throughout their history.<n>EvoScat aims to provide researchers with a mean to produce scalable visualizations that can help them explore and characterize evolution datasets.<n>The paper shows how the tool can be tailored to specific analysis needs thanks to its support for flexible configuration of history scaling and alignment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long lived software projects encompass a large number of artifacts, which undergo many revisions throughout their history. Empirical software engineering researchers studying software evolution gather and collect datasets with millions of events, representing changes introduced to specific artifacts. In this paper, we propose EvoScat, a tool that attempts addressing temporal scalability through the usage of interactive density scatterplot to provide a global overview of large historical datasets mined from open source repositories in a single visualization. EvoScat intents to provide researchers with a mean to produce scalable visualizations that can help them explore and characterize evolution datasets, as well as comparing the histories of individual artifacts, both in terms of 1) observing how rapidly different artifacts age over multiple-year-long time spans 2) how often metrics associated with each artifacts tend towards an improvement or worsening. The paper shows how the tool can be tailored to specific analysis needs (pace of change comparison, clone detection, freshness assessment) thanks to its support for flexible configuration of history scaling and alignment along the time axis, artifacts sorting and interactive color mapping, enabling the analysis of millions of events obtained by mining the histories of tens of thousands of software artifacts. We include in this paper a gallery showcasing datasets gathering specific artifacts (OpenAPI descriptions, GitHub workflow definitions) across multiple repositories, as well as diving into the history of specific popular open source projects.
Related papers
- PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization [82.96200364977737]
We introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks.<n>PlotCraft is structured around seven high-level visualization tasks and encompasses 48 distinct chart types.<n>It is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities.
arXiv Detail & Related papers (2025-10-15T10:14:39Z) - PyPotteryLens: An Open-Source Deep Learning Framework for Automated Digitisation of Archaeological Pottery Documentation [0.0]
PyPotteryLens is a framework that automates the digitisation and processing of archaeological pottery drawings from published sources.<n>The framework achieves over 97% precision and recall in pottery detection and classification tasks.<n>It reduces processing time by up to 5x to 20x compared to manual methods.
arXiv Detail & Related papers (2024-12-16T09:01:32Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed.<n>We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA)<n>Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - Replication: Contrastive Learning and Data Augmentation in Traffic
Classification Using a Flowpic Input Representation [47.95762911696397]
We reproduce [16] on the same datasets and replicate its most salient aspect (the importance of data augmentation) on three additional public datasets.
While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset.
arXiv Detail & Related papers (2023-09-18T12:55:09Z) - DeepVATS: Deep Visual Analytics for Time Series [7.822594828788055]
We present DeepVATS, an open-source tool that brings the field of Deep Visual Analytics into time series data.
DeepVATS trains, in a self-supervised way, a masked time series autoencoder that reconstructs patches of a time series.
We report on results that validate the utility of DeepVATS, running experiments on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-08T03:26:50Z) - A Framework for Large Scale Synthetic Graph Dataset Generation [2.248608623448951]
This work proposes a scalable synthetic graph generation tool to scale the datasets to production-size graphs.
The tool learns a series of parametric models from proprietary datasets that can be released to researchers.
We demonstrate the generalizability of the framework across a series of datasets.
arXiv Detail & Related papers (2022-10-04T22:41:33Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - Robust Image Retrieval-based Visual Localization using Kapture [10.249293519246478]
We present a versatile pipeline for visual localization that facilitates the use of different local and global features.
We evaluate our methods on eight public datasets where they rank top on all and first on many of them.
To foster future research, we release code, models, and all datasets used in this paper in the kapture format open source under a permissive BSD license.
arXiv Detail & Related papers (2020-07-27T21:10:35Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z) - Combining Visual and Textual Features for Semantic Segmentation of
Historical Newspapers [2.5899040911480187]
We introduce a multimodal approach for the semantic segmentation of historical newspapers.
Based on experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features.
Results show consistent improvement of multimodal models in comparison to a strong visual baseline.
arXiv Detail & Related papers (2020-02-14T17:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.