DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data
- URL: http://arxiv.org/abs/2601.12366v1
- Date: Sun, 18 Jan 2026 11:51:09 GMT
- Title: DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data
- Authors: Jiafei Zhang, Songliang Cao, Binghui Xu, Yanan Li, Weiwei Jia, Tingting Wu, Hao Lu, Weijuan Hu, Zhiguo Han,
- Abstract summary: DepthCropSeg++ is a foundation model for crop segmentation, capable of segmenting different crop species under open in-field environment.<n>We build upon a state-of-the-art semantic segmentation architecture ViT-Adapter architecture, enhance it with dynamic upAdapter architecture, and train the model with a two-stage selftraining pipeline.<n>Results demonstrate that DepthCropSeg++ achieves 93.11% moU on a comprehensive testing set, outperforming both supervised baselines and general vision foundation models.
- Score: 8.868203469534269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DepthCropSeg++: a foundation model for crop segmentation, capable of segmenting different crop species under open in-field environment. Crop segmentation is a fundamental task for modern agriculture, which closely relates to many downstream tasks such as plant phenotyping, density estimation, and weed control. In the era of foundation models, a number of generic large language and vision models have been developed. These models have demonstrated remarkable real world generalization due to significant model capacity and largescale datasets. However, current crop segmentation models mostly learn from limited data due to expensive pixel-level labelling cost, often performing well only under specific crop types or controlled environment. In this work, we follow the vein of our previous work DepthCropSeg, an almost unsupervised approach to crop segmentation, to scale up a cross-species and crossscene crop segmentation dataset, with 28,406 images across 30+ species and 15 environmental conditions. We also build upon a state-of-the-art semantic segmentation architecture ViT-Adapter architecture, enhance it with dynamic upsampling for improved detail awareness, and train the model with a two-stage selftraining pipeline. To systematically validate model performance, we conduct comprehensive experiments to justify the effectiveness and generalization capabilities across multiple crop datasets. Results demonstrate that DepthCropSeg++ achieves 93.11% mIoU on a comprehensive testing set, outperforming both supervised baselines and general-purpose vision foundation models like Segmentation Anything Model (SAM) by significant margins (+0.36% and +48.57% respectively). The model particularly excels in challenging scenarios including night-time environment (86.90% mIoU), high-density canopies (90.09% mIoU), and unseen crop varieties (90.09% mIoU), indicating a new state of the art for crop segmentation.
Related papers
- From Semantic To Instance: A Semi-Self-Supervised Learning Approach [6.092973123903838]
We propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model.<n>We use GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features.<n>The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%.
arXiv Detail & Related papers (2025-06-19T19:38:01Z) - RemoteSAM: Towards Segment Anything for Earth Observation [29.707796048411705]
We aim to develop a robust yet flexible visual foundation model for Earth observation.<n>It should possess strong capabilities in recognizing and localizing diverse visual targets.<n>We present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks.
arXiv Detail & Related papers (2025-05-23T15:27:57Z) - SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z) - Enhancing Ecological Monitoring with Multi-Objective Optimization: A Novel Dataset and Methodology for Segmentation Algorithms [17.802456388479616]
We introduce a unique semantic segmentation dataset of 6,096 high-resolution aerial images capturing indigenous and invasive grass species in Bega Valley, New South Wales, Australia.
This dataset presents a challenging task due to the overlap and distribution of grass species.
The dataset and code will be made publicly available, aiming to drive research in computer vision, machine learning, and ecological studies.
arXiv Detail & Related papers (2024-07-25T18:27:27Z) - Concept Drift and Long-Tailed Distribution in Fine-Grained Visual Categorization: Benchmark and Method [84.68818879525568]
We present a Concept Drift and Long-Tailed Distribution dataset.
The characteristics of instances tend to vary with time and exhibit a long-tailed distribution.
We propose a feature recombination framework to address the learning challenges associated with CDLT.
arXiv Detail & Related papers (2023-06-04T12:42:45Z) - Agave crop segmentation and maturity classification with deep learning
data-centric strategies using very high-resolution satellite imagery [101.18253437732933]
We present an Agave tequilana Weber azul crop segmentation and maturity classification using very high resolution satellite imagery.
We solve real-world deep learning problems in the very specific context of agave crop segmentation.
With the resulting accurate models, agave production forecasting can be made available for large regions.
arXiv Detail & Related papers (2023-03-21T03:15:29Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision
Farming [3.4788711710826083]
We propose an alternative solution with respect to the common data augmentation methods, applying it to the problem of crop/weed segmentation in precision farming.
We create semi-artificial samples by replacing the most relevant object classes (i.e., crop and weeds) with their synthesized counterparts.
In addition to RGB data, we take into account also near-infrared (NIR) information, generating four channel multi-spectral synthetic images.
arXiv Detail & Related papers (2020-09-12T08:49:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.