Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI
- URL: http://arxiv.org/abs/2603.01104v1
- Date: Sun, 01 Mar 2026 13:43:04 GMT
- Title: Egocentric Co-Pilot: Web-Native Smart-Glasses Agents for Assistive Egocentric AI
- Authors: Sicheng Yang, Yukai Huang, Weitong Cai, Shitong Sun, Fengyi Fang, You He, Yiqiao Xie, Jiankang Deng, Hang Zhang, Jifei Song, Zhensong Zhang,
- Abstract summary: We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses.<n>We use a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools.<n> Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance.
- Score: 56.98603185789977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What if accessing the web did not require a screen, a stable desk, or even free hands? For people navigating crowded cities, living with low vision, or experiencing cognitive overload, smart glasses coupled with AI agents could turn the web into an always-on assistive layer over daily life. We present Egocentric Co-Pilot, a web-native neuro-symbolic framework that runs on smart glasses and uses a Large Language Model (LLM) to orchestrate a toolbox of perception, reasoning, and web tools. An egocentric reasoning core combines Temporal Chain-of-Thought with Hierarchical Context Compression to support long-horizon question answering and decision support over continuous first-person video, far beyond a single model's context window. Additionally, a lightweight multimodal intent layer maps noisy speech and gaze into structured commands. We further implement and evaluate a cloud-native WebRTC pipeline integrating streaming speech, video, and control messages into a unified channel for smart glasses and browsers. In parallel, we deploy an on-premise WebSocket baseline, exposing concrete trade-offs between local inference and cloud offloading in terms of latency, mobility, and resource use. Experiments on Egolife and HD-EPIC demonstrate competitive or state-of-the-art egocentric QA performance, and a human-in-the-loop study on smart glasses shows higher task completion and user satisfaction than leading commercial baselines. Taken together, these results indicate that web-connected egocentric co-pilots can be a practical path toward more accessible, context-aware assistance in everyday life. By grounding operation in web-native communication primitives and modular, auditable tool use, Egocentric Co-Pilot offers a concrete blueprint for assistive, always-on web agents that support education, accessibility, and social inclusion for people who may benefit most from contextual, egocentric AI.
Related papers
- Agentic Very Long Video Understanding [39.34545320553102]
EGAgent is an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time.<n>Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning.<n>EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.
arXiv Detail & Related papers (2026-01-26T05:20:47Z) - EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT [56.24624833924252]
EgoThinker is a framework that endows MLs with robust egocentric reasoning capabilities through-temporal chain-of-thought supervision and a two-stage learning curriculum.<n>EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained-temporal localization tasks.
arXiv Detail & Related papers (2025-10-27T17:38:17Z) - Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence [109.32705135051486]
Embodied Web Agents is a novel paradigm for AI agents that fluidly bridge the embodiment and web-scale reasoning.<n>We release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks.<n>Results reveal significant performance gaps between state-of-the-art AI systems and human capabilities.
arXiv Detail & Related papers (2025-06-18T17:58:17Z) - EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z) - Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model [49.90916095152366]
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model.<n>vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance.<n>We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos.
arXiv Detail & Related papers (2024-12-30T16:57:05Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - Embodied AI-Driven Operation of Smart Cities: A Concise Review [3.441021278275805]
Embodied AI focuses on learning through interaction with the surrounding environment.
We will go through its definitions, its characteristics, and its current achievements along with different algorithms, approaches, and solutions.
We will then explore all the available simulators and 3D interactable databases that will make the research in this area feasible.
arXiv Detail & Related papers (2021-08-22T19:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.