MISAR: A Multimodal Instructional System with Augmented Reality
- URL: http://arxiv.org/abs/2310.11699v1
- Date: Wed, 18 Oct 2023 04:15:12 GMT
- Title: MISAR: A Multimodal Instructional System with Augmented Reality
- Authors: Jing Bi, Nguyen Manh Nguyen, Ali Vosoughi, Chenliang Xu
- Abstract summary: Augmented reality (AR) requires seamless integration of visual, auditory, and linguistic channels for optimized human-computer interaction.
Our study introduces an innovative method harnessing large language models (LLMs) to assimilate information from visual, auditory, and contextual modalities.
- Score: 38.79160527414268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Augmented reality (AR) requires the seamless integration of visual, auditory,
and linguistic channels for optimized human-computer interaction. While
auditory and visual inputs facilitate real-time and contextual user guidance,
the potential of large language models (LLMs) in this landscape remains largely
untapped. Our study introduces an innovative method harnessing LLMs to
assimilate information from visual, auditory, and contextual modalities.
Focusing on the unique challenge of task performance quantification in AR, we
utilize egocentric video, speech, and context analysis. The integration of LLMs
facilitates enhanced state estimation, marking a step towards more adaptive AR
systems. Code, dataset, and demo will be available at
https://github.com/nguyennm1024/misar.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - LLM-Assisted Visual Analytics: Opportunities and Challenges [4.851427485686741]
We explore the integration of large language models (LLMs) into visual analytics (VA) systems.
We highlight the new possibilities that LLMs bring to VA, especially how they can change VA processes beyond the usual use cases.
We carefully consider the prominent challenges of using current LLMs in VA tasks.
arXiv Detail & Related papers (2024-09-04T13:24:03Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - IM-RAG: Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues [10.280113107290067]
The IM-RAG approach integrates Information Retrieval systems with Large Language Models (LLMs) to support multi-round RAG.
The entire IM process is optimized via Reinforcement Learning (RL) where a Progress Tracker is incorporated to provide mid-step rewards.
The results show that our approach achieves state-of-the-art (SOTA) performance while providing high flexibility in integrating IR modules.
arXiv Detail & Related papers (2024-05-15T12:41:20Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.