A Reproducible Workflow for Scraping, Structuring, and Segmenting Legacy Archaeological Artifact Images
- URL: http://arxiv.org/abs/2512.11817v1
- Date: Thu, 27 Nov 2025 14:29:05 GMT
- Title: A Reproducible Workflow for Scraping, Structuring, and Segmenting Legacy Archaeological Artifact Images
- Authors: Juan Palomeque-Gonzalez,
- Abstract summary: The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS)<n>To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines.<n>The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This technical note presents a reproducible workflow for converting a legacy archaeological image collection into a structured and segmentation ready dataset. The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS), a dataset that provides thousands of standardised photographs but no mechanism for bulk download or automated processing. To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines; and an image processing pipeline that renames files using UUIDs, generates binary masks and bounding boxes through classical computer vision, and stores all derived information in a COCO compatible Json file enriched with archaeological metadata. The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared. Together, these components provide a lightweight and reusable approach for transforming web based archaeological image collections into machine learning friendly formats, facilitating downstream analysis and contributing to more reproducible research practices in digital archaeology.
Related papers
- A Dataset for Named Entity Recognition and Relation Extraction from Art-historical Image Descriptions [0.379152625956354]
FRAME is a manually annotated dataset of art-historical image descriptions for Named Entity Recognition (NER) and Relation Extraction (RE)<n> Descriptions were collected from museum catalogs, auction listings, open-access platforms, and scholarly databases.<n>The dataset is released as UIMA XMI Common Analysis Structure (CAS) files with accompanying images and metadata, and can be used to benchmark and fine-tune NER and RE systems.
arXiv Detail & Related papers (2026-02-22T11:29:03Z) - Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z) - Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection [132.63712430690856]
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains.<n>Data augmentation and generative methods have shown promise in few-shot learning, but their effectiveness for CD-FSOD remains unclear.<n>We propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD.
arXiv Detail & Related papers (2025-06-06T08:41:09Z) - RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model [52.250939617273744]
We propose RAWMamba, a Mamba-based unified framework for sRGB-to-RAW de-rendering.
The core of RAWMamba is the Unified Metadata Embedding (UME) module, which harmonizes diverse metadata types into a unified representation.
The Local Tone-Aware Mamba module captures long-range dependencies to enable effective global propagation of metadata.
arXiv Detail & Related papers (2024-11-18T16:45:44Z) - In-Context LoRA for Diffusion Transformers [49.288489286276146]
We show that text-to-image DiTs can effectively perform in-context generation without any tuning.
We name our models In-Context LoRA (IC-LoRA)
Our pipeline generates high-fidelity image sets that better adhere to prompts.
arXiv Detail & Related papers (2024-10-31T09:45:00Z) - MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation [54.64194935409982]
We introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer-wise RGBA decompositions.
MuLAn is the first photorealistic resource providing instance decomposition and spatial information for high quality images.
We aim to encourage the development of novel generation and editing technology, in particular layer-wise solutions.
arXiv Detail & Related papers (2024-04-03T14:58:00Z) - A Multimodal Approach for Cross-Domain Image Retrieval [5.5547914920738]
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision.
This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models.
Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation.
arXiv Detail & Related papers (2024-03-22T12:08:16Z) - AutArch: An AI-assisted workflow for object detection and automated recording in archaeological catalogues [35.253552063074366]
This paper introduces a new workflow for collecting data from archaeological find catalogues available as legacy resources.<n>The workflow relies on custom software (AutArch) supporting image processing, object detection, and interactive means of validating and adjusting automatically retrieved data.<n>We integrate artificial intelligence (AI) in terms of neural networks for object detection and classification into the workflow.
arXiv Detail & Related papers (2023-11-29T17:24:04Z) - Automatic Recognition of Learning Resource Category in a Digital Library [6.865460045260549]
We introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification.
The approach involves decomposing individual learning resources into constituent document images (sheets)
These images are then processed through an OCR tool to extract textual representation.
arXiv Detail & Related papers (2023-11-28T07:48:18Z) - ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer [31.46969412692045]
This paper pioneers the direct detection of land-cover changes utilizing paired OSM data and optical imagery.
We propose an object-guided Transformer (Former) by naturally combining the object-based image analysis (OBIA) technique with the advanced vision Transformer architecture.
A large-scale benchmark dataset called OpenMapCD is constructed to conduct detailed experiments.
arXiv Detail & Related papers (2023-10-04T09:26:44Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - ArcAid: Analysis of Archaeological Artifacts using Drawings [23.906975910478142]
Archaeology is an intriguing domain for computer vision.
It suffers not only from shortage in (labeled) data, but also from highly-challenging data, which is often extremely abraded and damaged.
This paper proposes a novel semi-supervised model for classification and retrieval of images of archaeological artifacts.
arXiv Detail & Related papers (2022-11-17T11:57:01Z) - Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images.
We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image.
We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.