Related papers: DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

URL: http://arxiv.org/abs/2404.07917v1
Date: Thu, 11 Apr 2024 16:59:54 GMT
Title: DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
Authors: Anna C. Doris, Daniele Grandi, Ryan Tomich, Md Ferdous Alam, Hyunmin Cheong, Faez Ahmed,
Abstract summary: This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation.
Score: 3.3554851717552387
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This research introduces DesignQA, a novel benchmark aimed at evaluating the proficiency of multimodal large language models (MLLMs) in comprehending and applying engineering requirements in technical documentation. Developed with a focus on real-world engineering challenges, DesignQA uniquely combines multimodal data-including textual design requirements, CAD images, and engineering drawings-derived from the Formula SAE student competition. Different from many existing MLLM benchmarks, DesignQA contains document-grounded visual questions where the input image and input document come from different sources. The benchmark features automatic evaluation metrics and is divided into segments-Rule Comprehension, Rule Compliance, and Rule Extraction-based on tasks that engineers perform when designing according to requirements. We evaluate state-of-the-art models like GPT4 and LLaVA against the benchmark, and our study uncovers the existing gaps in MLLMs' abilities to interpret complex engineering documentation. Key findings suggest that while MLLMs demonstrate potential in navigating technical documents, substantial limitations exist, particularly in accurately extracting and applying detailed requirements to engineering designs. This benchmark sets a foundation for future advancements in AI-supported engineering design processes. DesignQA is publicly available at: https://github.com/anniedoris/design_qa/.

Related papers

Exploring the Use of LLMs for Requirements Specification in an IT Consulting Company [0.39563752273706504]
This paper reports our experience using large language models (LLMs) to automate the requirements specification process.<n>We show that LLMs can help automate and standardize the requirements specification, reducing time and human effort.<n>However, the quality of LLM-generated FDS highly depends on inputs and often requires human revision.
arXiv Detail & Related papers (2025-07-25T09:49:37Z)
Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z)
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs [25.945493464645548]
multimodal large language models (MLLMs) present promising opportunities for automation and enhancement in Electronic Design Automation (EDA)<n>We introduce MMCircuitEval, the first multimodal benchmark specifically designed to assess MLLM performance across diverse EDA tasks.<n> MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs spanning digital and analog circuits across critical EDA stages.
arXiv Detail & Related papers (2025-07-20T05:46:32Z)
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation [31.237236649603123]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering.<n>DesignBench is a benchmark for assessing MLLMs' capabilities in automated front-end engineering.
arXiv Detail & Related papers (2025-06-06T17:21:21Z)
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents [17.296425855109426]
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents.<n>TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks.<n>We implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models.
arXiv Detail & Related papers (2025-05-19T16:11:23Z)
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own [1.2891210250935148]
This paper presents a comprehensive investigation into the capability of Large Language Models (LLMs) to successfully complete a control systems course. We assess LLM performance using ChatGPT under a "minimal effort" protocol that simulates realistic student usage patterns. Our analysis provides quantitative insights into AI's strengths and limitations in handling mathematical formulations, coding challenges, and theoretical concepts in control systems engineering.
arXiv Detail & Related papers (2025-02-23T18:47:14Z)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding. Existing solutions often rely on task-specific architectures and objectives for individual tasks. In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z)
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA) Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents. We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z)
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report [0.0]
Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. They require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. We present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs.
arXiv Detail & Related papers (2024-06-17T10:45:47Z)
Automated User Story Generation with Test Case Specification Using Large Language Model [0.0]
We developed a tool "GeneUS" to automatically create user stories from requirements documents. The output is provided in format leaving the possibilities open for downstream integration to the popular project management tools.
arXiv Detail & Related papers (2024-04-02T01:45:57Z)
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering [74.99736967448423]
We construct Design2Code - the first real-world benchmark for this task. We manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics. Our fine-grained break-down metrics indicate that models mostly lag in recalling visual elements from the input webpages and generating correct layout designs.
arXiv Detail & Related papers (2024-03-05T17:56:27Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
LLM4EDA: Emerging Progress in Large Language Models for Electronic Design Automation [74.7163199054881]
Large Language Models (LLMs) have demonstrated their capability in context understanding, logic reasoning and answer generation. We present a systematic study on the application of LLMs in the EDA field. We highlight the future research direction, focusing on applying LLMs in logic synthesis, physical design, multi-modal feature extraction and alignment of circuits.
arXiv Detail & Related papers (2023-12-28T15:09:14Z)
From Concept to Manufacturing: Evaluating Vision-Language Models for Engineering Design [5.268919870502001]
This paper presents a comprehensive evaluation of vision-language models (VLMs) across a spectrum of engineering design tasks. Specifically in this paper, we assess the capabilities of two VLMs, GPT-4V and LLaVA 1.6 34B, in design tasks such as sketch similarity analysis, CAD generation, topology optimization, manufacturability assessment, and engineering textbook problems.
arXiv Detail & Related papers (2023-11-21T15:20:48Z)
JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving [77.51817534090789]
We propose textbfJiuZhang2.0, a unified Chinese PLM specially for multi-task mathematical problem solving. Our idea is to maintain a moderate-sized model and employ the emphcross-task knowledge sharing to improve the model capacity in a multi-task setting.
arXiv Detail & Related papers (2023-06-19T15:45:36Z)
Natural Language Processing for Systems Engineering: Automatic Generation of Systems Modelling Language Diagrams [0.10312968200748115]
An approach is proposed to assist systems engineers in the automatic generation of systems diagrams from unstructured natural language text. The intention is to provide the users with a more standardised, comprehensive and automated starting point onto which subsequently refine and adapt the diagrams according to their needs.
arXiv Detail & Related papers (2022-08-09T19:20:33Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
Engineering AI Systems: A Research Agenda [9.84673609667263]
We provide a conceptualization of the typical evolution patterns that companies experience when employing machine learning. The main contribution of the paper is a research agenda for AI engineering that provides an overview of the key engineering challenges surrounding ML solutions.
arXiv Detail & Related papers (2020-01-16T20:29:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.