Related papers: GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI

GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI

URL: http://arxiv.org/abs/2510.13816v1
Date: Fri, 19 Sep 2025 21:29:13 GMT
Title: GQVis: A Dataset of Genomics Data Questions and Visualizations for Generative AI
Authors: Skylar Sargent Walters, Arthea Valderrama, Thomas C. Smits, David Kouřil, Huyen N. Nguyen, Sehi L'Yi, Devin Lange, Nils Gehlenborg,
Abstract summary: We present a framework for generating a dataset that pairs abstract, low-level questions about genomics data with corresponding visualizations.<n>We use genomics data retrieved from three distinct genomics data repositories (4DN, ENCODE, Chromoscope) to produce GQVis.
Score: 8.6230004297124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data visualization is a fundamental tool in genomics research, enabling the exploration, interpretation, and communication of complex genomic features. While machine learning models show promise for transforming data into insightful visualizations, current models lack the training foundation for domain-specific tasks. In an effort to provide a foundational resource for genomics-focused model training, we present a framework for generating a dataset that pairs abstract, low-level questions about genomics data with corresponding visualizations. Building on prior work with statistical plots, our approach adapts to the complexity of genomics data and the specialized representations used to depict them. We further incorporate multiple linked queries and visualizations, along with justifications for design choices, figure captions, and image alt-texts for each item in the dataset. We use genomics data retrieved from three distinct genomics data repositories (4DN, ENCODE, Chromoscope) to produce GQVis: a dataset consisting of 1.14 million single-query data points, 628k query pairs, and 589k query chains. The GQVis dataset and generation code are available at https://huggingface.co/datasets/HIDIVE/GQVis and https://github.com/hms-dbmi/GQVis-Generation.

Related papers

Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation [64.23194519770897]
We build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions.<n>We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset.<n>We train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities.
arXiv Detail & Related papers (2025-08-18T07:24:33Z)
Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees [50.78679002846741]
We propose a novel approach to cross-task generalization in graphs via task-trees.<n>We show that pretraining a graph neural network (GNN) on diverse task-trees with a reconstruction objective induces transferable knowledge.<n>This enables efficient adaptation to downstream tasks with minimal fine-tuning.
arXiv Detail & Related papers (2024-12-21T02:07:43Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z)
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed.<n>We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA)<n>Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points.
arXiv Detail & Related papers (2024-09-03T17:54:40Z)
A Scalable Tool For Analyzing Genomic Variants Of Humans Using Knowledge Graphs and Machine Learning [7.928994572633366]
We present a comprehensive approach for leveraging knowledge graphs and graph machine learning to analyze genomic variants. The proposed method involves extracting variant-level genetic information, annotating the data with additional metadata using SnpEff, and converting the enriched Variant Call Format files into Resource Description Framework triples. The resulting knowledge graph is further enhanced with patient metadata and stored in a graph database, facilitating efficient querying and indexing.
arXiv Detail & Related papers (2024-07-30T14:56:10Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
VisionKG: Unleashing the Power of Visual Datasets via Knowledge Graph [2.3143591448419074]
Vision Knowledge Graph (VisionKG) is a novel resource that interlinks, organizes and manages visual datasets via knowledge graphs and Semantic Web technologies. VisionKG currently contains 519 million RDF triples that describe approximately 40 million entities.
arXiv Detail & Related papers (2023-09-24T11:19:13Z)
Privacy-Preserving Graph Machine Learning from Data to Computation: A Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning. We first review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation [32.83012699501051]
We improve generative data augmentation by formulating the data generation as context generation task. We cast downstream tasks into question answering format and adapt the fine-tuned context generators to the target task domain. We demonstrate substantial improvements in performance in few-shot, zero-shot settings.
arXiv Detail & Related papers (2022-05-25T09:28:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.