An Efficient Additive Kolmogorov-Arnold Transformer for Point-Level Maize Localization in Unmanned Aerial Vehicle Imagery
- URL: http://arxiv.org/abs/2601.07975v1
- Date: Mon, 12 Jan 2026 20:16:10 GMT
- Title: An Efficient Additive Kolmogorov-Arnold Transformer for Point-Level Maize Localization in Unmanned Aerial Vehicle Imagery
- Authors: Fei Li, Lang Qiao, Jiahao Fan, Yijia Xu, Shawn M. Kaeppler, Zhou Zhang,
- Abstract summary: High-resolution UAV photogrammetry has become a key technology for precision agriculture.<n>Point-level maize localization in UAV imagery remains challenging due to extremely small object-to-pixel ratios.<n>We propose the Additive Kolmogorov-Arnold Transformer (AKT) to address these challenges.
- Score: 9.080987184733456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-resolution UAV photogrammetry has become a key technology for precision agriculture, enabling centimeter-level crop monitoring and point-level plant localization. However, point-level maize localization in UAV imagery remains challenging due to (1) extremely small object-to-pixel ratios, typically less than 0.1%, (2) prohibitive computational costs of quadratic attention on ultra-high-resolution images larger than 3000 x 4000 pixels, and (3) agricultural scene-specific complexities such as sparse object distribution and environmental variability that are poorly handled by general-purpose vision models. To address these challenges, we propose the Additive Kolmogorov-Arnold Transformer (AKT), which replaces conventional multilayer perceptrons with Pade Kolmogorov-Arnold Network (PKAN) modules to enhance functional expressivity for small-object feature extraction, and introduces PKAN Additive Attention (PAA) to model multiscale spatial dependencies with reduced computational complexity. In addition, we present the Point-based Maize Localization (PML) dataset, consisting of 1,928 high-resolution UAV images with approximately 501,000 point annotations collected under real field conditions. Extensive experiments show that AKT achieves an average F1-score of 62.8%, outperforming state-of-the-art methods by 4.2%, while reducing FLOPs by 12.6% and improving inference throughput by 20.7%. For downstream tasks, AKT attains a mean absolute error of 7.1 in stand counting and a root mean square error of 1.95-1.97 cm in interplant spacing estimation. These results demonstrate that integrating Kolmogorov-Arnold representation theory with efficient attention mechanisms offers an effective framework for high-resolution agricultural remote sensing.
Related papers
- A Domain-Adapted Lightweight Ensemble for Resource-Efficient Few-Shot Plant Disease Classification [0.0]
We present a few-shot learning approach that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors.<n>For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms.<n>It consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot.<n> Notably, it also outperformed the previous SOTA accuracy of 96.4% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning.
arXiv Detail & Related papers (2025-12-15T15:17:29Z) - Evaluating the Efficacy of Sentinel-2 versus Aerial Imagery in Serrated Tussock Classification [1.7975159705384043]
Serrated tussock (textitNassella trichotoma) is a highly competitive invasive grass species that disrupts native grasslands, reduces pasture productivity, and increases land management costs.<n>Current ground surveys and subsequent management practices are effective at small scales, but they are not feasible for landscape-scale monitoring.<n>Satellite-based remote sensing provides a more cost-effective and scalable alternative, though often with lower spatial resolution.
arXiv Detail & Related papers (2025-12-12T04:10:44Z) - YOLOv11-Litchi: Efficient Litchi Fruit Detection based on UAV-Captured Agricultural Imagery in Complex Orchard Environments [6.862722449907841]
This paper introduces YOLOv11-Litchi, a lightweight and robust detection model specifically designed for UAV-based litchi detection.<n>YOLOv11-Litchi achieves a parameter size of 6.35 MB - 32.5% smaller than the YOLOv11 baseline.<n>The model achieves a frame rate of 57.2 FPS, meeting real-time detection requirements.
arXiv Detail & Related papers (2025-10-11T09:44:00Z) - An Improved YOLOv8 Approach for Small Target Detection of Rice Spikelet Flowering in Field Environments [1.0288898584996287]
This study proposes a rice spikelet flowering recognition method based on an improved YOLOv8 object detection model.<n>BiFPN replaces the original PANet structure to enhance feature fusion and improve multi-scale feature utilization.<n>Given the lack of publicly available datasets for rice spikelet flowering in field conditions, a high-resolution RGB camera and data augmentation techniques are used.
arXiv Detail & Related papers (2025-07-28T04:01:29Z) - Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification [9.706130801069143]
Land Cover Classification identifies land cover types on sub-meter remote imagery.<n>Most existing methods focus on 1 m imagery and rely heavily on large-scale annotations.<n>We introduce Baltimore Atlas, a generalization; land cover classification framework that reduces reliance on large-scale training data.
arXiv Detail & Related papers (2025-06-18T15:41:29Z) - Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation [49.13393683126712]
Urban forests play a key role in enhancing environmental quality and supporting biodiversity in cities.<n> accurately detecting trees is challenging due to complex landscapes and the variability in image resolution caused by different satellite sensors or UAV flight altitudes.<n>We propose a novel pipeline that integrates domain adaptation with GANs and Diffusion models to enhance the quality of low-resolution aerial images.
arXiv Detail & Related papers (2025-05-21T03:57:10Z) - ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling [15.369357830312914]
ORBIT-2 is a scalable foundation model for global, hyper-resolution climate downscaling.<n>It incorporates Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction.<n> TILES is a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear.
arXiv Detail & Related papers (2025-05-07T21:09:00Z) - Recognize Any Regions [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.<n>Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm [111.17100512647619]
This paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA)
We propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block.
Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach.
arXiv Detail & Related papers (2022-06-19T04:49:35Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - Estimating Crop Primary Productivity with Sentinel-2 and Landsat 8 using
Machine Learning Methods Trained with Radiative Transfer Simulations [58.17039841385472]
We take advantage of all parallel developments in mechanistic modeling and satellite data availability for advanced monitoring of crop productivity.
Our model successfully estimates gross primary productivity across a variety of C3 crop types and environmental conditions even though it does not use any local information from the corresponding sites.
This highlights its potential to map crop productivity from new satellite sensors at a global scale with the help of current Earth observation cloud computing platforms.
arXiv Detail & Related papers (2020-12-07T16:23:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.