Related papers: Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding

Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding

URL: http://arxiv.org/abs/2603.01324v1
Date: Sun, 01 Mar 2026 23:50:08 GMT
Title: Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding
Authors: Anna Michailidou, Georgios Angelidis, Vasileios Argyriou, Panagiotis Sarigiannidis, Georgios Th. Papadopoulos,
Abstract summary: We present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding.<n>We focus on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD.<n>The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach.
Score: 4.918510966192794
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.

Related papers

Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models [17.259725776748482]
Existing adversarial training methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness.<n>We propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images.<n> QT-AFT achieves state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets.
arXiv Detail & Related papers (2025-07-22T06:13:30Z)
Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models [0.65268245109828]
We introduce the notion of contextual diversity for active learning CDAL. We propose a data repair algorithm to curate contextually fair data to reduce model bias. We are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads.
arXiv Detail & Related papers (2024-11-04T09:43:33Z)
PEEKABOO: Hiding parts of an image for unsupervised object localization [7.161489957025654]
Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information. We propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization. The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision.
arXiv Detail & Related papers (2024-07-24T20:35:20Z)
Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information. We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z)
Aligning and Prompting Everything All at Once for Universal Visual Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks. APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection. Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
Vocabulary-informed Zero-shot and Open-set Learning [128.83517181045815]
We propose vocabulary-informed learning to address problems of supervised, zero-shot, generalized zero-shot and open set recognition. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
arXiv Detail & Related papers (2023-01-03T08:19:22Z)
On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors. We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Unsupervised and Interpretable Domain Adaptation to Rapidly Filter Tweets for Emergency Services [18.57009530004948]
We present a novel method to classify relevant tweets during an ongoing crisis using the publicly available dataset of TREC incident streams. We use dedicated attention layers for each task to provide model interpretability; critical for real-word applications. We show a practical implication of our work by providing a use-case for the COVID-19 pandemic.
arXiv Detail & Related papers (2020-03-04T06:40:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.