Fugu-MT 論文翻訳(概要): Research on Vision-Language Question Answering Models for Industrial Robots

論文の概要: Research on Vision-Language Question Answering Models for Industrial Robots

arxiv url: http://arxiv.org/abs/2605.01483v1
Date: Sat, 02 May 2026 15:11:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.796812
Title: Research on Vision-Language Question Answering Models for Industrial Robots
Title（参考訳）: 産業用ロボットの視覚・言語質問応答モデルに関する研究
Authors: Ping Li, Bartlomiej Brzozka,
Abstract要約: 産業ロボットにおける視覚言語質問応答(VLQA)の階層的相互モーダル融合モデルを提案する。このフレームワークは、高度なオブジェクト検出、マルチスケールのビジュアルエンコーディング、構文解析、タスク認識セマンティックアテンションを統合し、視覚と言語信号を統合推論空間に統合する。
参考スコア（独自算出の注目度）: 6.470944338393257
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: A hierarchical cross-modal fusion model is proposed for vision-language question answering (VLQA) in industrial robotics, targeting the challenges of semantic ambiguity, complex environmental layouts, and domain-specific terminology common in modern manufacturing. The framework integrates advanced object detection, multi-scale visual encoding, syntactic parsing, and task-aware semantic attention to unite vision and language signals into a joint reasoning space. Region-based deep networks extract visual features, weighted embeddings aggregate, and recurrent neural parsing encodes sentence structures. Through fine-grained semantic alignment driven by adaptive fusion and cross-attention mechanisms, the system can handle operational queries, instruction steps, and anomaly detection with higher reliability. Compared to the existing VLQA benchmarks, validation experiments conducted on the IVQA and RIF benchmarks indicate improvements in semantic alignment, Top-1 accuracy, and robustness to ambiguous or procedural task queries. Ablation studies further quantify the impact of each architectural module, confirming the necessity of multi-level feature integration and context-driven gating for dependable industrial deployment. The technical advancements reported here provide core methodologies to improve the interpretability and operational effectiveness of industrial robots faced with diverse human-robot interaction tasks.
Abstract（参考訳）: 産業ロボティクスにおける視覚言語質問応答(VLQA)の階層的相互モーダル融合モデルを提案する。このフレームワークは、高度なオブジェクト検出、マルチスケールのビジュアルエンコーディング、構文解析、タスク認識セマンティックアテンションを統合し、視覚と言語信号を統合推論空間に統合する。地域ベースのディープネットワークは、視覚的特徴を抽出し、重み付けされた埋め込みを集約し、繰り返し神経解析によって文構造を符号化する。アダプティブフュージョンとクロスアテンション機構によって駆動されるきめ細かいセマンティックアライメントにより、システムは高い信頼性で操作クエリ、命令ステップ、異常検出を処理できる。既存のVLQAベンチマークと比較すると、IVQAとRIFベンチマークで実施された検証実験は、セマンティックアライメントの改善、トップ1の精度、曖昧なタスククエリや手続き的なタスククエリに対する堅牢性を示している。アブレーション研究は、各アーキテクチャモジュールの影響をさらに定量化し、信頼性の高い産業展開のためのマルチレベル機能統合とコンテキスト駆動ゲーティングの必要性を確認する。ここで報告された技術的進歩は、多様なロボットとロボットのインタラクションタスクに直面する産業ロボットの解釈可能性と運用性を改善するための中核となる方法論を提供する。

論文の概要: Research on Vision-Language Question Answering Models for Industrial Robots

関連論文リスト