AVA: Towards Autonomous Visualization Agents through Visual
Perception-Driven Decision-Making
- URL: http://arxiv.org/abs/2312.04494v1
- Date: Thu, 7 Dec 2023 18:13:42 GMT
- Title: AVA: Towards Autonomous Visualization Agents through Visual
Perception-Driven Decision-Making
- Authors: Shusen Liu, Haichao Miao, Zhimin Li, Matthew Olson, Valerio Pascucci,
Peer-Timo Bremer
- Abstract summary: We develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language.
The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs.
Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals.
- Score: 19.09644604789813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With recent advances in multi-modal foundation models, the previously
text-only large language models (LLM) have evolved to incorporate visual input,
opening up unprecedented opportunities for various applications in
visualization. Our work explores the utilization of the visual perception
ability of multi-modal LLMs to develop Autonomous Visualization Agents (AVAs)
that can interpret and accomplish user-defined visualization objectives through
natural language. We propose the first framework for the design of AVAs and
present several usage scenarios intended to demonstrate the general
applicability of the proposed paradigm. The addition of visual perception
allows AVAs to act as the virtual visualization assistant for domain experts
who may lack the knowledge or expertise in fine-tuning visualization outputs.
Our preliminary exploration and proof-of-concept agents suggest that this
approach can be widely applicable whenever the choices of appropriate
visualization parameters require the interpretation of previous visual output.
Feedback from unstructured interviews with experts in AI research, medical
visualization, and radiology has been incorporated, highlighting the
practicality and potential of AVAs. Our study indicates that AVAs represent a
general paradigm for designing intelligent visualization systems that can
achieve high-level visualization goals, which pave the way for developing
expert-level visualization agents in the future.
Related papers
- Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations [15.052986179046076]
We introduce MedVP, a pioneering framework that integrates medical entity extraction, visual prompt generation, and dataset adaptation for visual prompt guided fine-tuning.
We successfully outperform recent state-of-the-art large models across multiple medical VQA datasets.
arXiv Detail & Related papers (2025-01-04T21:23:36Z) - VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding [12.082379948480257]
This paper proposes InsightSee, a multi-agent framework to enhance vision-language models' capabilities in handling complex visual understanding scenarios.
The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation.
The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
arXiv Detail & Related papers (2024-05-31T13:56:55Z) - Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning.
It embeds question awareness directly within the vision encoder.
This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
Visual Representation [26.039045505150526]
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
We present a new comprehensive benchmark, General Visual Understanding Evaluation, covering the full spectrum of visual cognitive abilities.
arXiv Detail & Related papers (2022-11-28T15:06:07Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.