AVA: Towards Autonomous Visualization Agents through Visual
Perception-Driven Decision-Making
- URL: http://arxiv.org/abs/2312.04494v1
- Date: Thu, 7 Dec 2023 18:13:42 GMT
- Title: AVA: Towards Autonomous Visualization Agents through Visual
Perception-Driven Decision-Making
- Authors: Shusen Liu, Haichao Miao, Zhimin Li, Matthew Olson, Valerio Pascucci,
Peer-Timo Bremer
- Abstract summary: We develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language.
The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs.
Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals.
- Score: 19.09644604789813
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With recent advances in multi-modal foundation models, the previously
text-only large language models (LLM) have evolved to incorporate visual input,
opening up unprecedented opportunities for various applications in
visualization. Our work explores the utilization of the visual perception
ability of multi-modal LLMs to develop Autonomous Visualization Agents (AVAs)
that can interpret and accomplish user-defined visualization objectives through
natural language. We propose the first framework for the design of AVAs and
present several usage scenarios intended to demonstrate the general
applicability of the proposed paradigm. The addition of visual perception
allows AVAs to act as the virtual visualization assistant for domain experts
who may lack the knowledge or expertise in fine-tuning visualization outputs.
Our preliminary exploration and proof-of-concept agents suggest that this
approach can be widely applicable whenever the choices of appropriate
visualization parameters require the interpretation of previous visual output.
Feedback from unstructured interviews with experts in AI research, medical
visualization, and radiology has been incorporated, highlighting the
practicality and potential of AVAs. Our study indicates that AVAs represent a
general paradigm for designing intelligent visualization systems that can
achieve high-level visualization goals, which pave the way for developing
expert-level visualization agents in the future.
Related papers
- Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models [24.579822095003685]
We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA)
We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.
arXiv Detail & Related papers (2024-07-22T12:26:08Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding [12.082379948480257]
This paper proposes InsightSee, a multi-agent framework to enhance vision-language models' capabilities in handling complex visual understanding scenarios.
The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation.
The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
arXiv Detail & Related papers (2024-05-31T13:56:55Z) - Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans.
This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations.
In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Question Aware Vision Transformer for Multimodal Reasoning [14.188369270753347]
We introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal reasoning.
It embeds question awareness directly within the vision encoder.
This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
arXiv Detail & Related papers (2024-02-08T08:03:39Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceive, Ground, Reason, and Act: A Benchmark for General-purpose
Visual Representation [26.039045505150526]
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding.
We present a new comprehensive benchmark, General Visual Understanding Evaluation, covering the full spectrum of visual cognitive abilities.
arXiv Detail & Related papers (2022-11-28T15:06:07Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Modeling Explicit Concerning States for Reinforcement Learning in Visual
Dialogue [43.42833961578857]
We propose Explicit Concerning States (ECS) to represent what visual contents are concerned at each round and what have been concerned throughout the Visual Dialogue.
ECS is modeled from multimodal information and is represented explicitly.
Based on ECS, we formulate two intuitive and interpretable rewards to encourage the Visual Dialogue agents to converse on diverse and informative visual information.
arXiv Detail & Related papers (2021-07-12T08:15:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.