VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software
- URL: http://arxiv.org/abs/2505.24838v1
- Date: Fri, 30 May 2025 17:39:52 GMT
- Title: VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software
- Authors: Brandon Man, Ghadi Nehme, Md Ferdous Alam, Faez Ahmed,
- Abstract summary: VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations.<n>VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks.<n>We show two important downstream applications of VideoCAD: learning UI interactions from professional precision 3D CAD tools and a visual question-answering benchmark.
- Score: 3.668843811005568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt at engineering UI interaction learning for precision tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks, having up to a 20x longer time horizon than other datasets. We show two important downstream applications of VideoCAD: learning UI interactions from professional precision 3D CAD tools and a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models' (LLM) spatial reasoning and video understanding abilities. To learn the UI interactions, we propose VideoCADFormer - a state-of-the-art model in learning CAD interactions directly from video, which outperforms multiple behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.
Related papers
- RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base [112.72361202480154]
We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base.<n> Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach.
arXiv Detail & Related papers (2025-06-23T17:19:41Z) - CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation [4.092348452904736]
This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input.<n>Leveraging a novel dataset that we created--GenCAD-Code, consisting of over 163k CAD-model image and code pairs--CAD-Coder outperforms state-of-the-art VLM baselines.
arXiv Detail & Related papers (2025-05-20T17:34:44Z) - UniCAD: Efficient and Extendable Architecture for Multi-Task Computer-Aided Diagnosis System [48.83716673786449]
We propose UniCAD, a unified architecture that seamlessly handles both 2D and 3D medical images.<n>A low-rank adaptation strategy is employed to adapt a pre-trained visual model to the medical image domain, achieving performance on par with fully fine-tuned counterparts.<n>Building on this unified CAD architecture, we establish an open-source platform where researchers can share and access lightweight CAD experts.
arXiv Detail & Related papers (2025-05-14T06:21:27Z) - CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images [69.7768227804928]
CADCrafter is an image-to-parametric CAD model generation framework that trains solely on synthetic textureless CAD data.<n>We introduce a geometry encoder to accurately capture diverse geometric features.<n>Our approach can robustly handle real unconstrained CAD images, and even generalize to unseen general objects.
arXiv Detail & Related papers (2025-04-07T06:01:35Z) - Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization [12.12975824816803]
Reverse engineering 3D computer-aided design (CAD) models from images is an important task for many downstream applications.
In this work, we introduce a novel approach that conditionally factorizes the task into two sub-problems.
We propose TrAssembler that conditioned on the discrete structure with semantics predicts the continuous attribute values.
arXiv Detail & Related papers (2024-07-19T06:53:30Z) - OpenECAD: An Efficient Visual Language Model for Editable 3D-CAD Design [1.481550828146527]
We fine-tuned pre-trained models to create OpenECAD models (0.55B, 0.89B, 2.4B and 3.1B)
OpenECAD models can process images of 3D designs as input and generate highly structured 2D sketches and 3D construction commands.
These outputs can be directly used with existing CAD tools' APIs to generate project files.
arXiv Detail & Related papers (2024-06-14T10:47:52Z) - Model2Scene: Learning 3D Scene Representation via Contrastive
Language-CAD Models Pre-training [105.3421541518582]
Current successful methods of 3D scene perception rely on the large-scale annotated point cloud.
We propose Model2Scene, a novel paradigm that learns free 3D scene representation from Computer-Aided Design (CAD) models and languages.
Model2Scene yields impressive label-free 3D object salient detection with an average mAP of 46.08% and 55.49% on the ScanNet and S3DIS datasets, respectively.
arXiv Detail & Related papers (2023-09-29T03:51:26Z) - CAD-Estate: Large-scale CAD Model Annotation in RGB Videos [34.63782303927944]
We propose a method for annotating videos of complex multi-object scenes with a globally-consistent 3D representation of the objects.
We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation.
Our method is semi-automatic and works on commonly-available RGB videos, without requiring a depth sensor.
arXiv Detail & Related papers (2023-06-15T10:12:02Z) - Multiview Compressive Coding for 3D Reconstruction [77.95706553743626]
We introduce a simple framework that operates on 3D points of single objects or whole scenes.
Our model, Multiview Compressive Coding, learns to compress the input appearance and geometry to predict the 3D structure.
arXiv Detail & Related papers (2023-01-19T18:59:52Z) - AutoCAD: Automatically Generating Counterfactuals for Mitigating
Shortcut Learning [70.70393006697383]
We present AutoCAD, a fully automatic and task-agnostic CAD generation framework.
In this paper, we present AutoCAD, a fully automatic and task-agnostic CAD generation framework.
arXiv Detail & Related papers (2022-11-29T13:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.