Fugu-MT 論文翻訳(概要): MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

論文の概要: MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

arxiv url: http://arxiv.org/abs/2604.09552v1
Date: Sat, 31 Jan 2026 03:09:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.486071
Title: MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
Title（参考訳）: MCERF: 検索機能強化によるマルチモーダルLCM評価の改善
Authors: Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu,
Abstract要約: エンジニアリングルールブックと技術標準は、密集したテキスト、テーブル、イラストのようなマルチモーダル情報を含んでいる。この研究は、大規模言語モデル推論とマルチモーダルレトリバーを結合するシステムである、マルチモーダルColPali Enhanced Retrieval and Reasoning Framework (RFMCE)を確立する。これは、視覚言語検索、モジュール推論、適応的ルーティングが、エンジニアリングユースケースにおけるスケーラブルなドキュメント理解を可能にする方法を示している。
参考スコア（独自算出の注目度）: 7.964714175107759
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
Abstract（参考訳）: エンジニアリングルールブックと技術標準には、高密度テキスト、テーブル、イラストレーションなどのマルチモーダル情報が含まれており、RAG(Recovery augmented generation)システムでは困難である。この研究は、フルテキストの取り込みとテキストベースの検索に依存したDesignQAフレームワーク[1]に基づいて、エンジニアリング文書から正確で効率的な質問応答を推論する大規模な言語モデルとマルチモーダル検索を結合するMCERF(Multimodal ColPali Enhanced Retrieval and Reasoning Framework)を確立する。システムは、テキスト情報と視覚情報の両方を検索するColPaliと、複数の検索と推論戦略を採用している。 (i)明示的な規則記述のためのハイブリッドルックアップモード (ii)図形とテーブル案内クエリのためのテキスト融合へのビジョン三複合マルチモーダル質問に対する高共振LDMモード及び四応答を安定させる自己整合性の決定。モジュラーフレームワークの設計は、基盤となるモデルアーキテクチャに関係なく、将来のマルチモーダルシステムのための再利用可能なテンプレートを提供する。さらに、単一のケースルーティングアプローチと、最適なパイプラインに動的にクエリを割り当てるマルチエージェントシステムという、2つのルーティングアプローチを確立し比較する。 DesignQAベンチマークの評価によれば、このシステムはベースラインRAGの最良の結果から相対的に41.1%向上し、全タスクの平均精度が向上している。これは、視覚言語検索、モジュラー推論、適応的なルーティングによって、エンジニアリングユースケースにおけるスケーラブルなドキュメント理解を実現する方法を示している。

論文の概要: MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

関連論文リスト