VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
- URL: http://arxiv.org/abs/2510.04479v2
- Date: Fri, 10 Oct 2025 23:14:35 GMT
- Title: VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery
- Authors: Nonghai Zhang, Zeyu Zhang, Jiazi Wang, Yang Zhao, Hao Tang,
- Abstract summary: We propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis.<n>We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training.
- Score: 14.993425622341917
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research. Code: https://github.com/AIGeeksGroup/VaseVQA-3D. Website: https://aigeeksgroup.github.io/VaseVQA-3D.
Related papers
- Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds [57.024495128182195]
We conduct a pilot study across different observation spaces and visual representations.<n>Results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations.<n>We propose Any3D-VLA to address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases.
arXiv Detail & Related papers (2026-01-31T16:34:52Z) - GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation [31.365285503503475]
We present a framework for learning spatial reasoning using 2D boxes from standard detectors.<n>We show that when trained on GRAID data, models learn spatial reasoning concepts that generalize on over-detailed held-out types.<n>We also show that when trained on all questions types, achieve improvements on several existing benchmarks.
arXiv Detail & Related papers (2025-10-25T02:07:23Z) - 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding [11.069512983766783]
Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks.<n>We propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs.<n>Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks.
arXiv Detail & Related papers (2025-07-31T11:59:06Z) - TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment [14.535056813802527]
Testing-time Distribution Alignment (TeDA) is a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time.<n>TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings.<n>Experiments on four open-set 3D object retrieval benchmarks demonstrate TeDA greatly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-05T02:47:07Z) - Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis [65.42684641776931]
3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models.<n>We propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks.
arXiv Detail & Related papers (2025-03-28T13:32:29Z) - MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling [0.0]
Fine-tuning large generative models is a promising perspective for making these models available in fields like engineering.<n>We present MeshFleet, a filtered and annotated 3D dataset extracted from XL, the most extensive publicly available collection of 3D objects.<n>We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D.
arXiv Detail & Related papers (2025-03-18T08:09:24Z) - UVRM: A Scalable 3D Reconstruction Model from Unposed Videos [68.34221167200259]
Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples.<n>We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose.
arXiv Detail & Related papers (2025-01-16T08:00:17Z) - Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework [1.1280113914145702]
This research aims to design and develop a comprehensive and efficient framework for 3D segmentation tasks.<n>The framework integrates Grounding DINO and Segment anything Model, augmented by an enhancement in 2D image rendering via 3D mesh.
arXiv Detail & Related papers (2024-12-09T07:39:39Z) - Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field.
We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models.
This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z) - SketchANIMAR: Sketch-based 3D Animal Fine-Grained Retrieval [17.286320102183502]
We introduce a novel SHREC challenge track that focuses on retrieving relevant 3D animal models from a dataset using sketch queries.
Our contest requires participants to retrieve 3D models based on complex and detailed sketches.
We receive satisfactory results from eight teams and 204 runs.
arXiv Detail & Related papers (2023-04-12T09:40:38Z) - Objaverse: A Universe of Annotated 3D Objects [53.2537614157313]
We present averse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive tags, captions and animations.
We demonstrate the large potential of averse 3D models via four applications: training diverse 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied vision models, and creating a new benchmark for robustness analysis of vision models.
arXiv Detail & Related papers (2022-12-15T18:56:53Z) - Unsupervised Learning of 3D Object Categories from Videos in the Wild [75.09720013151247]
We focus on learning a model from multiple views of a large collection of object instances.
We propose a new neural network design, called warp-conditioned ray embedding (WCR), which significantly improves reconstruction.
Our evaluation demonstrates performance improvements over several deep monocular reconstruction baselines on existing benchmarks.
arXiv Detail & Related papers (2021-03-30T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.