Fugu-MT 論文翻訳(概要): EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

論文の概要: EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

arxiv url: http://arxiv.org/abs/2510.06371v1
Date: Tue, 07 Oct 2025 18:37:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.148148
Title: EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA
Title（参考訳）: EverydayMMQA: 文化的基盤を持つビジュアルQAのための多言語・マルチモーダルフレームワーク
Authors: Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling,
Abstract要約: Everyday Multimodal and Multilingual QA (EverydayMMQA)について紹介する。 OASISは、音声、画像、テキストを統合するマルチモーダルデータセットである。クローズドソースモデル4つ、オープンソースモデル3つ、微調整モデル1つをベンチマークした。
参考スコア（独自算出の注目度）: 22.30611382189773
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.
Abstract（参考訳）: 大規模マルチモーダルモデルは、視覚的質問回答(VQA)のようなタスクにおいて強力な結果をもたらすが、クエリが文化的に根ざした日常的知識を必要とする場合、特に低リソース言語や低表現言語では、しばしば失敗する。このギャップを埋めるために、音声および視覚的質問応答(SVQA)のための大規模かつ文化的なデータセットを作成するためのフレームワークであるEveryday Multimodal and Multilingual QA(EverydayMMQA)を紹介した。このフレームワークを用いて、音声、画像、テキストを統合したマルチモーダルデータセットであるOASISを開発した。 0.92M以上の画像と14.8MのQAペアを持つOASISには、3.7Mの音声質問が含まれており、音声のみ、テキストのみ、音声+画像、テキスト+画像の4つのユニークな入力の組み合わせが可能である。 18カ国の英語とアラビアの品種に焦点を合わせ、データセットの内容は多様な現実世界の状況を反映するようにキュレーションされている。 OASISは、実用的、常識的、文化的に認識された推論を含む、オブジェクト認識以外のタスクのモデルをテストする。クローズドソースモデル4つ、オープンソースモデル3つ、微調整モデル1つをベンチマークした。毎日MMQAとOASISは、文化的な文脈における日常的なタスクの包括的なセットのために、マルチモーダルなLLMを構築するためのベンチマークとトレーニングデータセットを提供する。フレームワークとデータセットは、コミュニティに公開される予定である。

論文の概要: EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

関連論文リスト