Related papers: MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection

MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection

URL: http://arxiv.org/abs/2505.02441v1
Date: Mon, 05 May 2025 08:10:22 GMT
Title: MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection
Authors: Jiaqi Zhang, Zhuodong Liu, Kejian Yu,
Abstract summary: Accurate identification of agricultural pests is essential for crop protection.<n>While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features.
Score: 3.5148549831413036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accurate identification of agricultural pests is essential for crop protection but remains challenging due to the large intra-class variance and fine-grained differences among pest species. While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features and lack effective multi-modal integration, leading to limited accuracy and poor interpretability. Moreover, the scarcity of high-quality multi-modal agricultural datasets further restricts progress in this field. To address these issues, we construct two novel multi-modal benchmarks-CTIP102 and STIP102-based on the widely-used IP102 dataset, and introduce a Multi-scale Cross-Modal Fusion Network (MSFNet-CPD) for robust pest detection. Our approach enhances visual quality via a super-resolution reconstruction module, and feeds both the original and reconstructed images into the network to improve clarity and detection performance. To better exploit semantic cues, we propose an Image-Text Fusion (ITF) module for joint modeling of visual and textual features, and an Image-Text Converter (ITC) that reconstructs fine-grained details across multiple scales to handle challenging backgrounds. Furthermore, we introduce an Arbitrary Combination Image Enhancement (ACIE) strategy to generate a more complex and diverse pest detection dataset, MTIP102, improving the model's generalization to real-world scenarios. Extensive experiments demonstrate that MSFNet-CPD consistently outperforms state-of-the-art methods on multiple pest detection benchmarks. All code and datasets will be made publicly available at: https://github.com/Healer-ML/MSFNet-CPD.

Related papers

HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection [4.908389661988192]
HFMF is a comprehensive two-stage deepfake detection framework.<n>It integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism.<n>We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks.
arXiv Detail & Related papers (2025-01-10T00:20:29Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia.<n>Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored.<n>We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Cross-Domain Separable Translation Network for Multimodal Image Change Detection [11.25422609271201]
multimodal change detection (MCD) is particularly critical in the remote sensing community. This paper focuses on addressing the challenges of MCD, especially the difficulty in comparing images from different sensors. A novel unsupervised cross-domain separable translation network (CSTN) is proposed to overcome these limitations.
arXiv Detail & Related papers (2024-07-23T03:56:02Z)
Robust CLIP-Based Detector for Exposing Diffusion Model-Generated Images [13.089550724738436]
Diffusion models (DMs) have revolutionized image generation, producing high-quality images with applications spanning various fields. Their ability to create hyper-realistic images poses significant challenges in distinguishing between real and synthetic content. This work introduces a robust detection framework that integrates image and text features extracted by CLIP model with a Multilayer Perceptron (MLP) classifier.
arXiv Detail & Related papers (2024-04-19T14:30:41Z)
A Multimodal Approach for Advanced Pest Detection and Classification [0.9003384937161055]
This paper presents a novel multi modal deep learning framework for enhanced agricultural pest detection. It combines tiny-BERT's natural language processing with R-CNN and ResNet-18's image processing.
arXiv Detail & Related papers (2023-12-18T05:54:20Z)
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection [55.48770333927732]
We propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection. It consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion's denoising network, and a feature-space pre-trained feature extractor. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-12-11T18:38:28Z)
Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-free Multi-Exposure Image Fusion [60.221404321514086]
Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. This paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions.
arXiv Detail & Related papers (2023-09-03T08:07:26Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Learning from Multi-Perception Features for Real-Word Image Super-resolution [87.71135803794519]
We propose a novel SR method called MPF-Net that leverages multiple perceptual features of input images. Our method incorporates a Multi-Perception Feature Extraction (MPFE) module to extract diverse perceptual information. We also introduce a contrastive regularization term (CR) that improves the model's learning capability.
arXiv Detail & Related papers (2023-05-26T07:35:49Z)
TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis [6.899641625551976]
We introduce a pure transformer-based method, which we refer to as Throughout Fusion Transformer (TFormer)", for sufficient information intergration in MSLD. We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way. Our TFormer outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2022-11-21T12:07:05Z)
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [74.19291916812921]
forged images generated by Deepfake techniques pose a serious threat to the trustworthiness of digital information. In this paper, we aim to capture the subtle manipulation artifacts at different scales for Deepfake detection. We introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods.
arXiv Detail & Related papers (2021-04-20T05:43:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.