Related papers: VizGenie: Toward Self-Refining, Domain-Aware Workflows for Next-Generation Scientific Visualization

VizGenie: Toward Self-Refining, Domain-Aware Workflows for Next-Generation Scientific Visualization

URL: http://arxiv.org/abs/2507.21124v1
Date: Fri, 18 Jul 2025 23:54:22 GMT
Title: VizGenie: Toward Self-Refining, Domain-Aware Workflows for Next-Generation Scientific Visualization
Authors: Ayan Biswas, Terece L. Turton, Nishath Rajiv Ranasinghe, Shawn Jones, Bradley Love, William Jones, Aric Hagberg, Han-Wei Shen, Nathan DeBardeleben, Earl Lawrence,
Abstract summary: VizGenie is a framework that advances scientific visualization through large language model (LLM)<n>A distinctive feature of VizGenie is its intuitive natural language interface, allowing users to issue high-level feature-based queries.
Score: 12.826592849136215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present VizGenie, a self-improving, agentic framework that advances scientific visualization through large language model (LLM) by orchestrating of a collection of domain-specific and dynamically generated modules. Users initially access core functionalities--such as threshold-based filtering, slice extraction, and statistical analysis--through pre-existing tools. For tasks beyond this baseline, VizGenie autonomously employs LLMs to generate new visualization scripts (e.g., VTK Python code), expanding its capabilities on-demand. Each generated script undergoes automated backend validation and is seamlessly integrated upon successful testing, continuously enhancing the system's adaptability and robustness. A distinctive feature of VizGenie is its intuitive natural language interface, allowing users to issue high-level feature-based queries (e.g., ``visualize the skull"). The system leverages image-based analysis and visual question answering (VQA) via fine-tuned vision models to interpret these queries precisely, bridging domain expertise and technical implementation. Additionally, users can interactively query generated visualizations through VQA, facilitating deeper exploration. Reliability and reproducibility are further strengthened by Retrieval-Augmented Generation (RAG), providing context-driven responses while maintaining comprehensive provenance records. Evaluations on complex volumetric datasets demonstrate significant reductions in cognitive overhead for iterative visualization tasks. By integrating curated domain-specific tools with LLM-driven flexibility, VizGenie not only accelerates insight generation but also establishes a sustainable, continuously evolving visualization practice. The resulting platform dynamically learns from user interactions, consistently enhancing support for feature-centric exploration and reproducible research in scientific visualization.

Related papers

Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook [85.43403500874889]
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI)<n>Recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains.
arXiv Detail & Related papers (2025-03-23T10:33:28Z)
InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions [22.007942964950217]
We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs.<n>This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses.
arXiv Detail & Related papers (2025-03-06T05:35:19Z)
Exploring the Potential of Large Language Models as Predictors in Dynamic Text-Attributed Graphs [23.655368505970443]
We pioneer using large language models (LLMs) for predictive tasks on dynamic graphs.<n>We propose the GraphAgent-Dynamic (GAD) Framework, a multi-agent system that leverages collaborative LLMs.<n>GAD incorporates global and local summary agents to generate domain-specific knowledge, enhancing its transferability across domains.
arXiv Detail & Related papers (2025-03-05T08:28:11Z)
Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks [0.0]
Vision-Driven Prompt Optimization (VDPO) generates textual prompts from visual inputs, guiding high-fidelity image synthesis.<n>VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores.<n>Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.
arXiv Detail & Related papers (2025-01-05T13:01:47Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent [8.212818176634116]
We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network. Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness.
arXiv Detail & Related papers (2024-11-08T15:50:30Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM. X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders. It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z)
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement [102.22911097049953]
Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks.<n>Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results.<n>We propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies.
arXiv Detail & Related papers (2024-05-24T23:09:27Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.