Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps
- URL: http://arxiv.org/abs/2411.17425v1
- Date: Tue, 26 Nov 2024 13:31:51 GMT
- Title: Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps
- Authors: Xue Xia, Randall Balestriero, Tao Zhang, Lorenz Hurni,
- Abstract summary: We propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS)
To mitigate this challenge, we explore self-supervised learning (SSL) techniques to enhance VIS performance on historical maps.
- Score: 16.35356981558991
- License:
- Abstract: Tracking geographic entities from historical maps, such as buildings, offers valuable insights into cultural heritage, urbanization patterns, environmental changes, and various historical research endeavors. However, linking these entities across diverse maps remains a persistent challenge for researchers. Traditionally, this has been addressed through a two-step process: detecting entities within individual maps and then associating them via a heuristic-based post-processing step. In this paper, we propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS). This method significantly streamlines geographic entity alignment and enhances automation. However, acquiring high-quality, video-format training data for VIS models is prohibitively expensive, especially for historical maps that often contain hundreds or thousands of geographic entities. To mitigate this challenge, we explore self-supervised learning (SSL) techniques to enhance VIS performance on historical maps. We evaluate the performance of VIS models under different pretraining configurations and introduce a novel method for generating synthetic videos from unlabeled historical map images for pretraining. Our proposed self-supervised VIS method substantially reduces the need for manual annotation. Experimental results demonstrate the superiority of the proposed self-supervised VIS approach, achieving a 24.9\% improvement in AP and a 0.23 increase in F1 score compared to the model trained from scratch.
Related papers
- Semantic Segmentation for Sequential Historical Maps by Learning from Only One Map [0.4915744683251151]
We propose an automated approach to digitization using deep-learning-based semantic segmentation.
A key challenge in this process is the lack of ground-truth annotations required for training deep neural networks.
We introduce a weakly-supervised age-tracing strategy for model fine-tuning.
arXiv Detail & Related papers (2025-01-03T14:55:22Z) - MapSAM: Adapting Segment Anything Model for Automated Feature Detection in Historical Maps [6.414068793245697]
We introduce MapSAM, a parameter-efficient fine-tuning strategy that adapts SAM into a prompt-free and versatile solution for historical map segmentation tasks.
Specifically, we employ Weight-Decomposed Low-Rank Adaptation (DoRA) to integrate domain-specific knowledge into the image encoder.
We develop an automatic prompt generation process, eliminating the need for manual input.
arXiv Detail & Related papers (2024-11-11T13:18:45Z) - TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation [48.75470418596875]
Training on large-scale datasets can boost the performance of video instance segmentation while the datasets for VIS are hard to scale up due to the high labor cost.
What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly train models across the aggregation of datasets to enhance data volume and diversity.
We conduct extensive evaluations on four popular and challenging benchmarks, including YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and UVO.
Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks.
arXiv Detail & Related papers (2023-12-11T18:50:09Z) - Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling [19.683714649646603]
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
arXiv Detail & Related papers (2023-08-19T08:45:49Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Cross-view Geo-localization via Learning Disentangled Geometric Layout
Correspondence [11.823147814005411]
Cross-view geo-localization aims to estimate the location of a query ground image by matching it to a reference geo-tagged aerial images database.
Recent works achieve outstanding progress on cross-view geo-localization benchmarks.
However, existing methods still suffer from poor performance on the cross-area benchmarks.
arXiv Detail & Related papers (2022-12-08T04:54:01Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Neural networks for semantic segmentation of historical city maps:
Cross-cultural performance and the impact of figurative diversity [0.0]
We present a new semantic segmentation model for historical city maps based on convolutional neural networks.
We show that these networks are able to semantically segment map data of a very large figurative diversity with efficiency.
arXiv Detail & Related papers (2021-01-29T09:08:12Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z) - A Transfer Learning approach to Heatmap Regression for Action Unit
intensity estimation [50.261472059743845]
Action Units (AUs) are geometrically-based atomic facial muscle movements.
We propose a novel AU modelling problem that consists of jointly estimating their localisation and intensity.
A Heatmap models whether an AU occurs or not at a given spatial location.
arXiv Detail & Related papers (2020-04-14T16:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.