Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments
- URL: http://arxiv.org/abs/2312.07935v4
- Date: Thu, 02 Oct 2025 19:49:07 GMT
- Title: Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments
- Authors: Ranjan Sapkota, Dawood Ahmed, Manoj Karkee,
- Abstract summary: This study compares the one stage YOLOv8 model with the two stage Mask R CNN model for instance segmentation.<n>Results showed YOLOv8 outperformed Mask R CNN with higher precision and near perfect recall at a confidence threshold of 0.5.
- Score: 2.925778409623925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instance segmentation is an important image processing operation for agricultural automation, providing precise delineation of individual objects within images and enabling tasks such as selective harvesting and precision pruning. This study compares the one stage YOLOv8 model with the two stage Mask R CNN model for instance segmentation under varying orchard conditions across two datasets. Dataset 1, collected in the dormant season, contains images of apple trees without foliage and was used to train multi object segmentation models delineating branches and trunks. Dataset 2, collected in the early growing season, includes canopy images with green foliage and immature apples and was used to train single object segmentation models delineating fruitlets. Results showed YOLOv8 outperformed Mask R CNN with higher precision and near perfect recall at a confidence threshold of 0.5. For Dataset 1, YOLOv8 achieved precision 0.90 and recall 0.95 compared to 0.81 and 0.81 for Mask R CNN. For Dataset 2, YOLOv8 reached precision 0.93 and recall 0.97 compared to 0.85 and 0.88. Inference times were also lower for YOLOv8, at 10.9 ms and 7.8 ms, versus 15.6 ms and 12.8 ms for Mask R CNN. These findings demonstrate superior accuracy and efficiency of YOLOv8 for real time orchard automation tasks such as robotic harvesting and fruit thinning.
Related papers
- BloomNet: Exploring Single vs. Multiple Object Annotation for Flower Recognition Using YOLO Variants [0.0]
This paper benchmarks several YOLO architectures such as YOLOv5s, YOLOv8n/s/m, and YOLOv12n for object detection under two annotation regimes.<n>The FloralSix dataset, comprising 2,816 high-resolution photos of six different flower species, is also introduced.
arXiv Detail & Related papers (2026-02-20T19:47:45Z) - NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation [47.32364120562497]
Novel Object Cyclic Threshold based Instance (NOCTIS) is a framework for designing a model general enough to be employed for novel objects.<n>We show that NOCTIS outperforms the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.
arXiv Detail & Related papers (2025-07-02T08:23:14Z) - RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity [0.8488322025656239]
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations.
A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations.
RF-DETR model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling.
YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment.
arXiv Detail & Related papers (2025-04-17T17:08:11Z) - Remote Sensing Image Classification Using Convolutional Neural Network (CNN) and Transfer Learning Techniques [1.024113475677323]
This study investigates the classification of aerial images depicting transmission towers, forests, farmland, and mountains.
To complete the classification job, features are extracted from input photos using a Convolutional Neural Network (CNN) architecture.
Our study shows that transfer learning models and MobileNetV2 in particular, work well for landscape categorization.
arXiv Detail & Related papers (2025-03-04T11:19:18Z) - Assessing the Capability of YOLO- and Transformer-based Object Detectors for Real-time Weed Detection [0.0]
All available models of YOLOv8, YOLOv9, YOLOv10, and RT-DETR are trained and evaluated with images from a real field situation.
The results demonstrate that while all models perform equally well in the metrics evaluated, the YOLOv9 models stand out in terms of their strong recall scores.
RT-DETR models, especially RT-DETR-l, excel in precision with reaching 82.44 % on dataset 1 and 81.46 % in dataset 2.
arXiv Detail & Related papers (2025-01-29T02:39:57Z) - Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development [0.36868085124383626]
This study presents a novel method for deep learning-based instance segmentation of apples in commercial orchards.
We synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model.
The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations.
arXiv Detail & Related papers (2024-11-18T05:11:29Z) - Comparing YOLOv11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard environment [0.4143603294943439]
YOLO11n-seg achieved the highest mask precision across all categories with a notable score of 0.831.<n>YOLO11m-seg and YOLO11l-seg excelled in non-occluded and occluded fruitlet segmentation.<n>YOLO11m-seg consistently outperformed, registering the highest scores for both box and mask segmentation.
arXiv Detail & Related papers (2024-10-24T00:12:20Z) - Comparison of Machine Learning Approaches for Classifying Spinodal Events [3.030969076856776]
We evaluate state-of-the-art models (MobileViT, NAT, EfficientNet, CNN) alongside several ensemble models (majority voting, AdaBoost)
Our findings show that NAT and MobileViT outperform other models, achieving the highest metrics-accuracy, AUC, and F1 score on both training and testing data.
arXiv Detail & Related papers (2024-10-13T07:27:00Z) - Performance Evaluation of YOLOv8 Model Configurations, for Instance Segmentation of Strawberry Fruit Development Stages in an Open Field Environment [0.0]
This study evaluates the performance of YOLOv8 model configurations for instance segmentation of strawberries into ripe and unripe stages in an open field environment.
The YOLOv8n model demonstrated superior segmentation accuracy with a mean Average Precision (mAP) of 80.9%, outperforming other YOLOv8 configurations.
arXiv Detail & Related papers (2024-08-11T00:33:45Z) - EffiSegNet: Gastrointestinal Polyp Segmentation through a Pre-Trained EfficientNet-based Network with a Simplified Decoder [0.8892527836401773]
This work introduces EffiSegNet, a novel segmentation framework leveraging transfer learning with a pre-trained Convolutional Neural Network (CNN) as its backbone.
We evaluate our model on the gastrointestinal polyp segmentation task using the publicly available Kvasir-SEG dataset, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-07-23T08:54:55Z) - MIMIC: Masked Image Modeling with Image Correspondences [29.8154890262928]
Current methods for building effective pretraining datasets rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments.
We propose a pretraining dataset-curation approach that does not require any additional annotations.
Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale.
arXiv Detail & Related papers (2023-06-27T00:40:12Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - Facilitated machine learning for image-based fruit quality assessment in
developing countries [68.8204255655161]
Automated image classification is a common task for supervised machine learning in food science.
We propose an alternative method based on pre-trained vision transformers (ViTs)
It can be easily implemented with limited resources on a standard device.
arXiv Detail & Related papers (2022-07-10T19:52:20Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - Contemplating real-world object classification [53.10151901863263]
We reanalyze the ObjectNet dataset recently proposed by Barbu et al. containing objects in daily life situations.
We find that applying deep models to the isolated objects, rather than the entire scene as is done in the original paper, results in around 20-30% performance improvement.
arXiv Detail & Related papers (2021-03-08T23:29:59Z) - A CNN Approach to Simultaneously Count Plants and Detect Plantation-Rows
from UAV Imagery [56.10033255997329]
We propose a novel deep learning method based on a Convolutional Neural Network (CNN)
It simultaneously detects and geolocates plantation-rows while counting its plants considering highly-dense plantation configurations.
The proposed method achieved state-of-the-art performance for counting and geolocating plants and plant-rows in UAV images from different types of crops.
arXiv Detail & Related papers (2020-12-31T18:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.