Related papers: DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

URL: http://arxiv.org/abs/2404.07917v2
Date: Fri, 23 Aug 2024 17:19:18 GMT
Title: DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
Authors: Anna C. Doris, Daniele Grandi, Ryan Tomich, Md Ferdous Alam, Mohammadmehdi Ataei, Hyunmin Cheong, Faez Ahmed,
Abstract summary: This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition.
Score: 3.2169312784098705
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models (at the time of writing) like GPT-4o, GPT-4, Claude-Opus, Gemini-1.0, and LLaVA-1.5 against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. The MLLMs tested, while promising, struggle to reliably retrieve relevant rules from the Formula SAE documentation, face challenges in recognizing technical components in CAD images, and encounter difficulty in analyzing engineering drawings. These findings underscore the need for multimodal models that can better handle the multifaceted questions characteristic of design according to technical documentation. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.

Related papers

The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own [1.2891210250935148]
This paper presents a comprehensive investigation into the capability of Large Language Models (LLMs) to successfully complete a control systems course. We assess LLM performance using ChatGPT under a "minimal effort" protocol that simulates realistic student usage patterns. Our analysis provides quantitative insights into AI's strengths and limitations in handling mathematical formulations, coding challenges, and theoretical concepts in control systems engineering.
arXiv Detail & Related papers (2025-02-23T18:47:14Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report [0.0]
Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. They require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. We present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs.
arXiv Detail & Related papers (2024-06-17T10:45:47Z)
Automated User Story Generation with Test Case Specification Using Large Language Model [0.0]
We developed a tool "GeneUS" to automatically create user stories from requirements documents. The output is provided in format leaving the possibilities open for downstream integration to the popular project management tools.
arXiv Detail & Related papers (2024-04-02T01:45:57Z)
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task. We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics. Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
LLM4EDA: Emerging Progress in Large Language Models for Electronic Design Automation [74.7163199054881]
Large Language Models (LLMs) have demonstrated their capability in context understanding, logic reasoning and answer generation. We present a systematic study on the application of LLMs in the EDA field. We highlight the future research direction, focusing on applying LLMs in logic synthesis, physical design, multi-modal feature extraction and alignment of circuits.
arXiv Detail & Related papers (2023-12-28T15:09:14Z)
From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design [5.268919870502001]
This paper presents a comprehensive evaluation of vision-language models (VLMs) across a spectrum of engineering design tasks. Specifically in this paper, we assess the capabilities of two VLMs, GPT-4V and LLaVA 1.6 34B, in design tasks such as sketch similarity analysis, CAD generation, topology optimization, manufacturability assessment, and engineering textbook problems.
arXiv Detail & Related papers (2023-11-21T15:20:48Z)
JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving [77.51817534090789]
We propose textbfJiuZhang2.0, a unified Chinese PLM specially for multi-task mathematical problem solving. Our idea is to maintain a moderate-sized model and employ the emphcross-task knowledge sharing to improve the model capacity in a multi-task setting.
arXiv Detail & Related papers (2023-06-19T15:45:36Z)
Natural Language Processing for Systems Engineering: Automatic Generation of Systems Modelling Language Diagrams [0.10312968200748115]
An approach is proposed to assist systems engineers in the automatic generation of systems diagrams from unstructured natural language text. The intention is to provide the users with a more standardised, comprehensive and automated starting point onto which subsequently refine and adapt the diagrams according to their needs.
arXiv Detail & Related papers (2022-08-09T19:20:33Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
Engineering AI Systems: A Research Agenda [9.84673609667263]
We provide a conceptualization of the typical evolution patterns that companies experience when employing machine learning. The main contribution of the paper is a research agenda for AI engineering that provides an overview of the key engineering challenges surrounding ML solutions.
arXiv Detail & Related papers (2020-01-16T20:29:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.