Lightweight Vision Transformer with Window and Spatial Attention for Food Image Classification
- URL: http://arxiv.org/abs/2509.18692v1
- Date: Tue, 23 Sep 2025 06:23:50 GMT
- Title: Lightweight Vision Transformer with Window and Spatial Attention for Food Image Classification
- Authors: Xinle Gao, Linghui Ye, Zhiyong Xiao,
- Abstract summary: We propose a lightweight food image classification algorithm that integrates a Window Multi-Head Attention Mechanism (WMHAM) and a Spatial Attention Mechanism (SAM)<n>Our model achieves accuracies of 95.24% and 94.33%, respectively, while significantly reducing parameters and FLOPs compared with baseline methods.
- Score: 1.1472801896854488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of society and continuous advances in science and technology, the food industry increasingly demands higher production quality and efficiency. Food image classification plays a vital role in enabling automated quality control on production lines, supporting food safety supervision, and promoting intelligent agricultural production. However, this task faces challenges due to the large number of parameters and high computational complexity of Vision Transformer models. To address these issues, we propose a lightweight food image classification algorithm that integrates a Window Multi-Head Attention Mechanism (WMHAM) and a Spatial Attention Mechanism (SAM). The WMHAM reduces computational cost by capturing local and global contextual features through efficient window partitioning, while the SAM adaptively emphasizes key spatial regions to improve discriminative feature representation. Experiments conducted on the Food-101 and Vireo Food-172 datasets demonstrate that our model achieves accuracies of 95.24% and 94.33%, respectively, while significantly reducing parameters and FLOPs compared with baseline methods. These results confirm that the proposed approach achieves an effective balance between computational efficiency and classification performance, making it well-suited for deployment in resource-constrained environments.
Related papers
- Flexible Manufacturing Systems Intralogistics: Dynamic Optimization of AGVs and Tool Sharing Using Coloured-Timed Petri Nets and Actor-Critic RL with Actions Masking [0.0]
This paper advances the traditional job shop scheduling problem by incorporating additional complexities through the simultaneous integration of automated guided vehicles (AGVs) and tool-sharing systems.<n>We propose a novel approach that combines Colored-Timed Petri Nets (CTPNs) with actor-critic model-based reinforcement learning (MBRL)<n>Our approach was evaluated on small-sized public benchmarks and a newly developed large-scale benchmark inspired by the Taillard benchmark.
arXiv Detail & Related papers (2026-01-08T12:37:02Z) - A Multi-objective Optimization Approach for Feature Selection in Gentelligent Systems [62.08647860272078]
This paper uses the term "Gentelligent system" to refer to systems that incorporate inherent component information and automated mechanisms.<n>By implementing reliable fault detection methods, manufacturers can achieve several benefits, including improved product quality, increased yield, and reduced production costs.
arXiv Detail & Related papers (2025-11-20T23:50:55Z) - AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock [77.95897723270453]
Crops, fisheries and livestock form the backbone of global food production, essential to feed the ever-growing global population.<n> Addressing these issues requires efficient, accurate, and scalable technological solutions, highlighting the importance of artificial intelligence (AI)<n>This survey presents a systematic and thorough review of more than 200 research works covering conventional machine learning approaches, advanced deep learning techniques, and recent vision-language foundation models.
arXiv Detail & Related papers (2025-07-29T17:59:48Z) - Swin-TUNA : A Novel PEFT Approach for Accurate Food Image Segmentation [3.061662434597098]
This paper introduces TUNable Adapter module (Swin-TUNA), a.<n> Efficient Fine-Tuning (PEFT) method that integrates multiscale trainable adapters into the.<n>Swin Transformer architecture.<n> Experiments demonstrate that this method achieves mIoU of 50.56% and 74.94% on the FoodSeg103 and UECFoodPix Complete datasets.
arXiv Detail & Related papers (2025-07-23T09:28:25Z) - Dual Atrous Separable Convolution for Improving Agricultural Semantic Segmentation [2.3636539018632616]
This study proposes an efficient image segmentation method for precision agriculture.<n>A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3-based segmentation framework.<n>It achieves more than 66% improvement in efficiency when considering the trade-off between model complexity and performance.
arXiv Detail & Related papers (2025-06-27T18:37:43Z) - Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models [96.43285670458803]
Uni-Food is a unified food dataset that comprises over 100,000 images with various food labels.<n>Uni-Food is designed to provide a more holistic approach to food data analysis.<n>We introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach to address the inherent challenges of food-related multitasking.
arXiv Detail & Related papers (2024-07-17T16:49:34Z) - Computer Vision in the Food Industry: Accurate, Real-time, and Automatic Food Recognition with Pretrained MobileNetV2 [1.6590638305972631]
This study employs the pretrained MobileNetV2 model, which is efficient and fast, for food recognition on the public Food11 dataset, comprising 16643 images.
It also utilizes various techniques such as dataset understanding, transfer learning, data augmentation, regularization, dynamic learning rate, hyper parameter tuning, and consideration of images in different sizes to enhance performance and robustness.
Despite employing a light model with a simpler structure and fewer trainable parameters compared to some deep and dense models in the deep learning area, it achieved commendable accuracy in a short time.
arXiv Detail & Related papers (2024-05-19T17:20:20Z) - Interpreting and Improving Attention From the Perspective of Large Kernel Convolution [51.06461246235176]
We introduce Large Kernel Convolutional Attention (LKCA), a novel formulation that reinterprets attention operations as a single large- Kernel convolution.<n>LKCA achieves competitive performance across various visual tasks, particularly in data-constrained settings.
arXiv Detail & Related papers (2024-01-11T08:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.