Related papers: From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design

URL: http://arxiv.org/abs/2311.12668v1
Date: Tue, 21 Nov 2023 15:20:48 GMT
Title: From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design
Authors: Cyril Picard, Kristen M. Edwards, Anna C. Doris, Brandon Man, Giorgio Giannone, Md Ferdous Alam, and Faez Ahmed
Abstract summary: This paper presents a comprehensive evaluation of GPT-4V, a vision language model, across a wide spectrum of engineering design tasks. Our study assesses GPT-4V's capabilities in design tasks such as sketch similarity analysis, concept selection using Pugh Charts, material selection, engineering drawing analysis, CAD generation, topology optimization, design for additive and subtractive manufacturing, spatial reasoning challenges, and textbook problems.
Score: 5.268919870502001
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Engineering Design is undergoing a transformative shift with the advent of AI, marking a new era in how we approach product, system, and service planning. Large language models have demonstrated impressive capabilities in enabling this shift. Yet, with text as their only input modality, they cannot leverage the large body of visual artifacts that engineers have used for centuries and are accustomed to. This gap is addressed with the release of multimodal vision language models, such as GPT-4V, enabling AI to impact many more types of tasks. In light of these advancements, this paper presents a comprehensive evaluation of GPT-4V, a vision language model, across a wide spectrum of engineering design tasks, categorized into four main areas: Conceptual Design, System-Level and Detailed Design, Manufacturing and Inspection, and Engineering Education Tasks. Our study assesses GPT-4V's capabilities in design tasks such as sketch similarity analysis, concept selection using Pugh Charts, material selection, engineering drawing analysis, CAD generation, topology optimization, design for additive and subtractive manufacturing, spatial reasoning challenges, and textbook problems. Through this structured evaluation, we not only explore GPT-4V's proficiency in handling complex design and manufacturing challenges but also identify its limitations in complex engineering design applications. Our research establishes a foundation for future assessments of vision language models, emphasizing their immense potential for innovating and enhancing the engineering design and manufacturing landscape. It also contributes a set of benchmark testing datasets, with more than 1000 queries, for ongoing advancements and applications in this field.

Related papers

Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design [0.06749750044497731]
We present an approach that mirrors this team structure with a Vision Language Model (VLM)-based Multi Agent System. A model is generated automatically from sketches and/ or textual descriptions. The resulting model can be refined collaboratively in an iterative validation loop with the user.
arXiv Detail & Related papers (2025-03-06T13:21:27Z)
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks [20.93006455952299]
Vision-language-action (VLA) models represent a promising direction for developing general-purpose robotic systems. We present a comprehensive evaluation framework and benchmark suite for assessing VLA models.
arXiv Detail & Related papers (2024-11-04T18:01:34Z)
What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks. We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z)
DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation [3.2169312784098705]
This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition.
arXiv Detail & Related papers (2024-04-11T16:59:54Z)
Geometric Deep Learning for Computer-Aided Design: A Survey [85.79012726689511]
This survey offers a comprehensive overview of learning-based methods in computer-aided design. It includes similarity analysis and retrieval, 2D and 3D CAD model synthesis, and CAD generation from point clouds. It provides a complete list of benchmark datasets and their characteristics, along with open-source codes that have propelled research in this domain.
arXiv Detail & Related papers (2024-02-27T17:11:35Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey [59.95153883166705]
Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture. Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
arXiv Detail & Related papers (2023-12-27T14:54:37Z)
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) [121.42924593374127]
We analyze the latest model, GPT-4V, to deepen the understanding of LMMs. GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs makes it a powerful multimodal generalist system. GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods.
arXiv Detail & Related papers (2023-09-29T17:34:51Z)
How Can Large Language Models Help Humans in Design and Manufacturing? [28.28959612862582]
Large Language Models (LLMs), including GPT-4, provide exciting new opportunities for generative design. We scrutinize the utility of LLMs in tasks such as: converting a text-based prompt into a design specification, transforming a design into manufacturing instructions, producing a design space and design variations, computing the performance of a design, and searching for designs predicated on performance. By exposing these limitations, we aspire to catalyze the continued improvement and progression of these models.
arXiv Detail & Related papers (2023-07-25T17:30:38Z)
Challenges and Practices of Deep Learning Model Reengineering: A Case Study on Computer Vision [3.510650664260664]
Many engineering organizations are reimplementing and extending deep neural networks from the research community. Deep learning model reengineering is challenging for reasons including under-documented reference models, changing requirements, and the cost of implementation and testing. Our study is focused on reengineering activities from a "process" view, and focuses on engineers specifically engaged in the reengineering process.
arXiv Detail & Related papers (2023-03-13T21:23:43Z)
Design Space Exploration and Explanation via Conditional Variational Autoencoders in Meta-model-based Conceptual Design of Pedestrian Bridges [52.77024349608834]
This paper provides a performance-driven design exploration framework to augment the human designer through a Conditional Variational Autoencoder (CVAE) The CVAE is trained on 18'000 synthetically generated instances of a pedestrian bridge in Switzerland.
arXiv Detail & Related papers (2022-11-29T17:28:31Z)
Engineering AI Systems: A Research Agenda [9.84673609667263]
We provide a conceptualization of the typical evolution patterns that companies experience when employing machine learning. The main contribution of the paper is a research agenda for AI engineering that provides an overview of the key engineering challenges surrounding ML solutions.
arXiv Detail & Related papers (2020-01-16T20:29:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.