An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution
- URL: http://arxiv.org/abs/2601.06235v1
- Date: Fri, 09 Jan 2026 15:13:32 GMT
- Title: An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution
- Authors: Sheng-Kai Chen, Jyh-Horng Wu, Ching-Yao Lin, Yen-Ting Lin,
- Abstract summary: The system employs dual-agent architecture where Agent 01 handles Automatic Speech Recognition (ASR) and Agent 02 manages AI processing through local Large Language Models (LLMs), Model Context Protocol (MCP) tools, and Retrieval-Augmented Generation (RAG)<n>The system supports real-time RTSP streaming for voice and video data transmission, eye tracking data collection, and remote task execution through RabbitMQ messaging.
- Score: 4.740337082971588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an AI glasses system that integrates real-time voice processing, artificial intelligence(AI) agents, and cross-network streaming capabilities. The system employs dual-agent architecture where Agent 01 handles Automatic Speech Recognition (ASR) and Agent 02 manages AI processing through local Large Language Models (LLMs), Model Context Protocol (MCP) tools, and Retrieval-Augmented Generation (RAG). The system supports real-time RTSP streaming for voice and video data transmission, eye tracking data collection, and remote task execution through RabbitMQ messaging. Implementation demonstrates successful voice command processing with multilingual support and cross-platform task execution capabilities.
Related papers
- RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform [49.43594274832262]
We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems.<n>RepoLaunch automates the remaining steps, enabling scalable benchmarking and training of coding agents and LLMs.
arXiv Detail & Related papers (2026-03-05T10:15:13Z) - Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z) - Scaling Synthetic Task Generation for Agents via Exploration [67.70129766322985]
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web navigation, and robotics.<n>Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information.<n>We present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to synthesize environment-grounded tasks.
arXiv Detail & Related papers (2025-09-29T17:00:02Z) - i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents [0.42970700836450487]
We analyze components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management.<n>Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF)
arXiv Detail & Related papers (2025-09-25T10:15:51Z) - Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition [17.451829471077858]
Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments.<n> Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods.<n>We propose the first collaborative multi-agent system for ACSR, named Cued-Agent.
arXiv Detail & Related papers (2025-08-01T07:40:39Z) - AgentMaster: A Multi-Agent Conversational Framework Using A2A and MCP Protocols for Multimodal Information Retrieval and Analysis [0.0]
We present a pilot study of AgentMaster, a novel modular multi-protocol MAS framework with self-implemented A2A and MCP.<n>The system supports natural language interaction without prior technical expertise and responds to multimodal queries for tasks including information retrieval, question answering, and image analysis.<n>Overall, our proposed framework contributes to the potential capabilities of domain-specific, cooperative, and scalable conversational AI powered by MAS.
arXiv Detail & Related papers (2025-07-08T03:34:26Z) - Asynchronous Tool Usage for Real-Time Agents [61.3041983544042]
We introduce asynchronous AI agents capable of parallel processing and real-time tool-use.
Our key contribution is an event-driven finite-state machine architecture for agent execution and prompting.
This work presents both a conceptual framework and practical tools for creating AI agents capable of fluid, multitasking interactions.
arXiv Detail & Related papers (2024-10-28T23:57:19Z) - RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent [15.836845304125436]
RS-Agent is an AI agent designed to interact with human users and autonomously leverage specialized models.<n> RS-Agent integrates four key components: a Central Controller based on large language models, a dynamic toolkit for tool execution, a Solution Space for task-specific expert guidance, and a Knowledge Space for domain-level reasoning.<n>Extensive experiments across 9 datasets and 18 remote sensing tasks demonstrate that RS-Agent significantly outperforms state-of-the-art MLLMs.
arXiv Detail & Related papers (2024-06-11T09:30:02Z) - SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering [79.07755560048388]
SWE-agent is a system that facilitates LM agents to autonomously use computers to solve software engineering tasks.
SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs.
We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively.
arXiv Detail & Related papers (2024-05-06T17:41:33Z) - SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society.
We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.