ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
- URL: http://arxiv.org/abs/2509.02949v1
- Date: Wed, 03 Sep 2025 02:26:48 GMT
- Title: ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
- Authors: Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura,
- Abstract summary: We propose a new multimodal QA dataset on assembly activities.<n>Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals.
- Score: 13.040491675077687
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.
Related papers
- AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation [31.02336903452371]
AirQA is a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI)<n>With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention.<n>ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
arXiv Detail & Related papers (2025-09-21T07:24:17Z) - Towards Building General Purpose Embedding Models for Industry 4.0 Agents [5.212780106286918]
We focus on improving language models' understanding for asset maintenance to guide the engineer's decisions and minimize asset downtime.<n>Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize queries of similar assets.<n>Our approach begins with gathering a qualitative, expert-vetted knowledge base to construct nine asset-specific task datasets.
arXiv Detail & Related papers (2025-06-14T19:02:07Z) - What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z) - LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z) - EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments [11.97783742296183]
Embodied Mobile Manipulation in Open Environments is a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space.<n>Embodied Mobile Manipulation in Open Environments seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment.<n>We designmodel, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms.
arXiv Detail & Related papers (2025-03-11T16:42:36Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding [9.921932789361732]
We present a novel evaluation dataset, ProMQA, to measure system advancements in application-oriented scenarios.
ProMQA consists of 401 multimodal procedural QA pairs on user recording of procedural activities coupled with their corresponding instruction.
arXiv Detail & Related papers (2024-10-29T16:39:28Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - Attributed Question Answering: Evaluation and Modeling for Attributed
Large Language Models [68.37431984231338]
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision.
We believe the ability of an LLM to an attribute to the text that it generates is likely to be crucial for both system developers and users in this setting.
arXiv Detail & Related papers (2022-12-15T18:45:29Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.