Related papers: Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

URL: http://arxiv.org/abs/2507.04017v1
Date: Sat, 05 Jul 2025 12:07:13 GMT
Title: Habitat Classification from Ground-Level Imagery Using Deep Neural Networks
Authors: Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August, Claire M Wood, Lan Qie, Petra Bosilj, James M Brown,
Abstract summary: This study applies state-of-the-art deep neural network architectures to ground-level habitat imagery.<n>We evaluate two families of models -- convolutional neural networks (CNNs) and vision transformers (ViTs)<n>Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics.
Score: 1.3408365072149797
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Habitat assessment at local scales -- critical for enhancing biodiversity and guiding conservation priorities -- often relies on expert field survey that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models -- convolutional neural networks (CNNs) and vision transformers (ViTs) -- under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91\%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at the national scale.

Related papers

Continental scale habitat modelling with artificial intelligence and multimodal earth observation [0.0]
Habitats integrate the abiotic conditions and biophysical structures that support biodiversity and sustain nature's contributions to people.<n>Current maps often fall short in thematic or spatial resolution because they must model several mutually exclusive habitat types.<n>Here, we evaluated how high-resolution remote sensing (RS) data and Artificial Intelligence (AI) tools can improve habitat classification over large geographic extents.
arXiv Detail & Related papers (2025-07-13T18:11:26Z)
Supervised and self-supervised land-cover segmentation & classification of the Biesbosch wetlands [0.0]
This study presents a methodology for wetland land-cover segmentation and classification that adopts both supervised and self-supervised learning (SSL)<n>We train a U-Net model from scratch on Sentinel-2 imagery across six wetland regions in the Netherlands, achieving a baseline model accuracy of 85.26%.<n> Addressing the limited availability of labeled data, the results show that SSL pretraining with an autoencoder can improve accuracy, especially for the high-resolution imagery.
arXiv Detail & Related papers (2025-05-27T14:42:49Z)
Zooming In on Fakes: A Novel Dataset for Localized AI-Generated Image Detection with Forgery Amplification Approach [69.01456182499486]
textbfBR-Gen is a large-scale dataset of 150,000 locally forged images with diverse scene-aware annotations.<n>textbfNFA-ViT is a Noise-guided Forgery Amplification Vision Transformer that enhances the detection of localized forgeries.
arXiv Detail & Related papers (2025-04-16T09:57:23Z)
Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification [12.923336716880506]
We integrate image captioning and retrieval-augmented generation (RAG) with large language models (LLMs) to enhance biodiversity monitoring.<n>Our findings highlight the potential for modern vision-language AI pipelines to support biodiversity conservation initiatives.
arXiv Detail & Related papers (2025-03-13T21:18:10Z)
Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments [57.59857784298534]
We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images.<n>This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes.
arXiv Detail & Related papers (2025-03-06T05:13:19Z)
On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery [0.0]
Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures for binary classification tasks in SSS imagery. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics.
arXiv Detail & Related papers (2024-09-18T14:36:50Z)
StrideNET: Swin Transformer for Terrain Recognition with Dynamic Roughness Extraction [0.0]
This paper presents StrideNET, a novel dual-branch architecture designed for terrain recognition and implicit properties estimation. The implications of this work extend to various applications, including environmental monitoring, land use and land cover (LULC) classification, disaster response, precision agriculture.
arXiv Detail & Related papers (2024-04-20T04:51:59Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
Embedding Earth: Self-supervised contrastive pre-training for dense land cover classification [61.44538721707377]
We present Embedding Earth a self-supervised contrastive pre-training method for leveraging the large availability of satellite imagery. We observe significant improvements up to 25% absolute mIoU when pre-trained with our proposed method. We find that learnt features can generalize between disparate regions opening up the possibility of using the proposed pre-training scheme.
arXiv Detail & Related papers (2022-03-11T16:14:14Z)
Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology [0.0]
We show that ensembles of Data-efficient image Transformers (DeiTs) significantly outperform the previous state of the art (SOTA) On all the data sets we test, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 18.48% to 87.50%.
arXiv Detail & Related papers (2022-03-03T14:16:22Z)
A Comprehensive Study of Image Classification Model Sensitivity to Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes. We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes. In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z)
Phase Consistent Ecological Domain Adaptation [76.75730500201536]
We focus on the task of semantic segmentation, where annotated synthetic data are aplenty, but annotating real data is laborious. The first criterion, inspired by visual psychophysics, is that the map between the two image domains be phase-preserving. The second criterion aims to leverage ecological statistics, or regularities in the scene which are manifest in any image of it, regardless of the characteristics of the illuminant or the imaging sensor.
arXiv Detail & Related papers (2020-04-10T06:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.