Related papers: Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring

Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring

URL: http://arxiv.org/abs/2508.01752v1
Date: Sun, 03 Aug 2025 13:36:40 GMT
Title: Vision transformer-based multi-camera multi-object tracking framework for dairy cow monitoring
Authors: Kumail Abbas, Zeeshan Afzal, Aqeel Raza, Taha Mansouri, Andrew W. Dowsey, Chaidate Inchaisri, Ali Alameer,
Abstract summary: This study developed a unique multi-camera, real-time tracking system for indoor-housed Holstein Friesian dairy cows.<n>This technology uses cutting-edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately.
Score: 0.06282171844772422
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Activity and behaviour correlate with dairy cow health and welfare, making continual and accurate monitoring crucial for disease identification and farm productivity. Manual observation and frequent assessments are laborious and inconsistent for activity monitoring. In this study, we developed a unique multi-camera, real-time tracking system for indoor-housed Holstein Friesian dairy cows. This technology uses cutting-edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately. An integrated top-down barn panorama was created by geometrically aligning six camera feeds using homographic transformations. The detection phase used a refined YOLO11-m model trained on an overhead cow dataset, obtaining high accuracy (mAP\@0.50 = 0.97, F1 = 0.95). SAMURAI, an upgraded Segment Anything Model 2.1, generated pixel-precise cow masks for instance segmentation utilizing zero-shot learning and motion-aware memory. Even with occlusion and fluctuating posture, a motion-aware Linear Kalman filter and IoU-based data association reliably identified cows over time for object tracking. The proposed system significantly outperformed Deep SORT Realtime. Multi-Object Tracking Accuracy (MOTA) was 98.7% and 99.3% in two benchmark video sequences, with IDF1 scores above 99% and near-zero identity switches. This unified multi-camera system can track dairy cows in complex interior surroundings in real time, according to our data. The system reduces redundant detections across overlapping cameras, maintains continuity as cows move between viewpoints, with the aim of improving early sickness prediction through activity quantification and behavioural classification.

Related papers

A multi-head deep fusion model for recognition of cattle foraging events using sound and movement signals [0.2450783418670958]
This work introduces a deep neural network based on the fusion of acoustic and inertial signals.<n>The main advantage of this model is the combination of signals through the automatic extraction of features independently from each of them.
arXiv Detail & Related papers (2025-05-15T11:55:16Z)
Consistent multi-animal pose estimation in cattle using dynamic Kalman filter based tracking [0.0]
KeySORT is an adaptive Kalman filter to construct tracklets in a bounding-box free manner, significantly improving the temporal consistency of detected keypoints.<n>Our test results indicate our algorithm is able to detect up to 80% of the ground truth keypoints with high accuracy.
arXiv Detail & Related papers (2025-03-13T15:15:54Z)
Holstein-Friesian Re-Identification using Multiple Cameras and Self-Supervision on a Working Farm [2.9391768712283772]
We present MultiCamCows2024, a farm-scale image dataset filmed across multiple cameras for the biometric identification of individual Holstein-Friesian cattle.<n>The dataset comprises 101,329 images of 90 cows, plus underlying original CCTV footage.<n>We report a performance above 96% single image identification accuracy from the dataset and demonstrate that combining data from multiple cameras during learning enhances self-supervised identification.
arXiv Detail & Related papers (2024-10-16T15:58:47Z)
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos [91.44553585470688]
Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
arXiv Detail & Related papers (2023-08-20T18:23:07Z)
Next-generation Surgical Navigation: Marker-less Multi-view 6DoF Pose Estimation of Surgical Instruments [64.59698930334012]
We present a multi-camera capture setup consisting of static and head-mounted cameras.<n>Second, we publish a multi-view RGB-D video dataset of ex-vivo spine surgeries, captured in a surgical wet lab and a real operating theatre.<n>Third, we evaluate three state-of-the-art single-view and multi-view methods for the task of 6DoF pose estimation of surgical instruments.
arXiv Detail & Related papers (2023-05-05T13:42:19Z)
Cut and Learn for Unsupervised Object Detection and Instance Segmentation [65.43627672225624]
Cut-and-LEaRn (CutLER) is a simple approach for training unsupervised object detection and segmentation models. CutLER is a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks.
arXiv Detail & Related papers (2023-01-26T18:57:13Z)
Livestock Monitoring with Transformer [4.298326853567677]
We develop an end-to-end behaviour monitoring system for group-housed pigs to perform simultaneous instance level segmentation, tracking, action recognition and re-identification tasks. We present starformer, the first end-to-end multiple-object livestock monitoring framework that learns instance-level embeddings for grouped pigs through the use of transformer architecture.
arXiv Detail & Related papers (2021-11-01T10:03:49Z)
Intra-Inter Camera Similarity for Unsupervised Person Re-Identification [50.85048976506701]
We study a novel intra-inter camera similarity for pseudo-label generation. We train our re-id model in two stages with intra-camera and inter-camera pseudo-labels, respectively. This simple intra-inter camera similarity produces surprisingly good performance on multiple datasets.
arXiv Detail & Related papers (2021-03-22T08:29:04Z)
Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification [60.36551512902312]
unsupervised person re-identification (re-ID) aims to learn discriminative models with unlabeled data. One popular method is to obtain pseudo-label by clustering and use them to optimize the model. In this paper, we propose a unified framework to solve both problems.
arXiv Detail & Related papers (2021-03-08T09:13:06Z)
Domain Adaptive Robotic Gesture Recognition with Unsupervised Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot. It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture. Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z)
Dairy Cow rumination detection: A deep learning approach [0.8312466807725921]
Rumination behavior is a significant variable for tracking the development and yield of animal husbandry. Modern attached devices are invasive, stressful and uncomfortable for the cattle. In this study, we introduce an innovative monitoring method using Convolution Neural Network (CNN)-based deep learning models.
arXiv Detail & Related papers (2021-01-07T07:33:32Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.