AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection
- URL: http://arxiv.org/abs/2601.04734v1
- Date: Thu, 08 Jan 2026 08:56:07 GMT
- Title: AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection
- Authors: Yunqing Hu, Zheming Yang, Chang Zhao, Qi Guo, Meng Gao, Pengcheng Li, Wen Ji,
- Abstract summary: This paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation.<n>To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy.<n>To maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm.
- Score: 15.419663374345845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
Related papers
- Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation [48.88299242238335]
Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models.<n>We propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture.
arXiv Detail & Related papers (2026-02-13T13:48:08Z) - OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z) - LQA: A Lightweight Quantized-Adaptive Framework for Vision-Language Models on the Edge [12.772499009055194]
We propose a lightweight, quantized-adaptive framework for Vision-Language Models (VLMs)<n>We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware.<n> Experiments show that LQA improves overall adaptation performance by 4.5%, uses less memory, and significantly outperforms gradient-based TTA methods.
arXiv Detail & Related papers (2026-02-08T07:37:37Z) - AsynDBT: Asynchronous Distributed Bilevel Tuning for efficient In-Context Learning with Large Language Models [4.4866154758274375]
In-context learning (ICL) has emerged as a promising paradigm that enables LLMs to adapt to new tasks using examples provided within the input.<n>Previous FL approaches that incorporate ICL have struggled with severe straggler problems and challenges associated with heterogeneous non-identically data.<n>We propose an asynchronous distributed bilevel tuning (AsynDBT) algorithm that optimize both in-context learning samples and prompt fragments based on the feedback from the LLM.
arXiv Detail & Related papers (2026-02-06T13:07:49Z) - AVERY: Adaptive VLM Split Computing through Embodied Self-Awareness for Efficient Disaster Response Systems [6.294240680169978]
Unmanned Aerial Vehicles (UAVs) in disaster response require complex, queryable intelligence that on-board CNNs cannot provide.<n>We present AVERY, a framework that enables VLM deployment through adaptive split computing.
arXiv Detail & Related papers (2025-11-22T18:42:04Z) - Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization [61.55616421408666]
Low-Altitude Economy Networks (LAENets) have enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection.<n> onboard vision (VLMs) offer inference for real-time inference but limited onboard dynamic network conditions.<n>We propose a UAV-enabled LAENet system that improves communication efficiency under dynamic LAENet conditions.
arXiv Detail & Related papers (2025-10-11T05:11:21Z) - Heterogeneous Multi-agent Collaboration in UAV-assisted Mobile Crowdsensing Networks [6.226837215382989]
Unmanned aerial vehicles (UAVs)-assisted mobile crowdsensing (MCS) has emerged as a promising paradigm for data collection.<n>We tackle challenges such as spectrum scarcity, device computation, and user mobility issues that hinder efficient coordination of sensing, communication, and resource allocation.
arXiv Detail & Related papers (2025-09-28T02:13:19Z) - Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection [9.198326035948613]
This paper proposes an adaptive guidance-based semantic enhancement edge-cloud collaborative object detection method.<n>It can reduce latency by over 79% and computational cost by 70% in low-light and highly occluded scenes.
arXiv Detail & Related papers (2025-09-24T08:25:37Z) - Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - Cloud-Device Collaborative Learning for Multimodal Large Language Models [24.65882336700547]
We introduce a Cloud-Device Collaborative Continual Adaptation framework to enhance the performance of compressed, device-deployed MLLMs.
Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission, cloud-based knowledge adaptation, and an optimized cloud-to-device downlink for model deployment.
arXiv Detail & Related papers (2023-12-26T18:46:14Z) - VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature
Alignment [52.489874804051304]
VoLTA is a new vision-language pre-training paradigm that only utilizes image-caption data but fine-grained region-level image understanding.
VoLTA pushes multi-modal fusion deep into the uni-modal backbones during pre-training.
Experiments on a wide range of vision- and vision-language downstream tasks demonstrate the effectiveness of VoLTA.
arXiv Detail & Related papers (2022-10-09T01:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.