Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems
- URL: http://arxiv.org/abs/2601.15676v1
- Date: Thu, 22 Jan 2026 05:57:25 GMT
- Title: Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems
- Authors: Hengfan Zhang, Yueqian Lin, Hai Helen Li, Yiran Chen,
- Abstract summary: CoFi-Agent is a hybrid architecture targeting edge servers and gateways.<n>It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected.<n>On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline.
- Score: 10.143590597259792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.
Related papers
- AmbShield: Enhancing Physical Layer Security with Ambient Backscatter Devices against Eavesdroppers [69.56534335936534]
AmbShield is an AmBD-assisted PLS scheme that leverages naturally distributed AmBDs to simultaneously strengthen the legitimate channel and degrade eavesdroppers'<n>In AmbShield, AmBDs are exploited as friendly jammers that randomly backscatter to create interference at eavesdroppers, and as passive relays that backscatter the desired signal to enhance the capacity of legitimate devices.
arXiv Detail & Related papers (2026-01-14T20:56:50Z) - Efficient Mixture-of-Agents Serving via Tree-Structured Routing, Adaptive Pruning, and Dependency-Aware Prefill-Decode Overlap [15.352230356342366]
Mixture-of-Agents (MoA) inference can suffer from dense inter-agent communication and low hardware utilization.<n>We present a serving design that targets these bottlenecks through an algorithm-system co-design.
arXiv Detail & Related papers (2025-12-19T23:06:58Z) - Information-Dense Reasoning for Efficient and Auditable Security Alert Triage [5.3761282240937796]
Security Operations Centers face massive, heterogeneous alert streams under minute-level service windows.<n>Existing solutions fail: signature systems are brittle, anomaly methods lack actionability, and fully cloud-hosted LLMs raise latency, cost, and privacy concerns.<n>We propose AIDR, a hybrid cloud-premises framework that addresses this trade-off through constrained information-specialized optimization.
arXiv Detail & Related papers (2025-12-09T01:57:24Z) - Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation [0.0]
CoSense-LLM is an edge-first framework that turns continuous multimodal sensor streams into compact semantic tokens.<n>System works with modern serving optimizations, including paged or streaming KV caches, Flash style kernels, speculative decoding, and quantized LoRA adapters.
arXiv Detail & Related papers (2025-10-22T15:16:56Z) - Adaptive Learning for IRS-Assisted Wireless Networks: Securing Opportunistic Communications Against Byzantine Eavesdroppers [7.256056777973974]
We propose a joint learning framework for Byzantine-resilient spectrum sensing and secure intelligent reflecting surface (IRS)<n>We develop an augmented-Lagrangian alternating algorithm with projected updates and provide provable sublinear convergence, with accelerated rates under mild local curvature.<n> Simulations across diverse network conditions show higher detection probability at fixed false-alarm rate under adversarial attacks, large reductions in sum MSE for honest users, strong suppression of eavesdropper signal power, and fast convergence.
arXiv Detail & Related papers (2025-08-11T17:28:25Z) - Multi-agent Auditory Scene Analysis [0.0]
Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification.<n>Running these tasks linearly increases the overall response time, while making the last tasks highly sensitive to errors of the first task (location)<n>A multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors.
arXiv Detail & Related papers (2025-07-03T16:16:46Z) - PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a.
Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns.
We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z) - Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for
5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity.
One of the main challenges comes from the real-time implementation of these algorithms.
This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.