Related papers: EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

URL: http://arxiv.org/abs/2510.06371v1
Date: Tue, 07 Oct 2025 18:37:32 GMT
Title: EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA
Authors: Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling,
Abstract summary: We introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA)<n>OASIS is a multimodal dataset integrating speech, images, and text.<n>We benchmarked four closed-source models, three open-source models, and one fine-tuned model.
Score: 22.30611382189773
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

Related papers

Multimodal Evaluation of Russian-language Architectures [88.00147763684451]
We introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures.<n>The benchmark is instruction-based and encompasses default text, image, audio, and video modalities.<n>Mera Multi provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages.
arXiv Detail & Related papers (2025-11-19T15:43:53Z)
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs [2.697578491761838]
IndicVisionBench is the first large-scale benchmark centered on the Indian subcontinent.<n>Our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA)<n>In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs.
arXiv Detail & Related papers (2025-11-06T18:01:22Z)
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models [56.775118098058506]
TowerVision is a family of open multilingual vision-language models for both image-text and video-text tasks.<n>By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches.<n>To support further research, we publicly release all models, data, and training recipes.
arXiv Detail & Related papers (2025-10-22T17:02:48Z)
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z)
SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs [12.60449414234283]
SpokenNativQA is the first multilingual and culturally aligned spoken question-answering dataset.<n>The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages.
arXiv Detail & Related papers (2025-05-25T14:22:18Z)
Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models [38.608158064184366]
We standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC)<n>These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS) and Paralinguistic Question Answering (PQA)<n>We propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently.
arXiv Detail & Related papers (2025-01-02T03:28:52Z)
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs [12.35947908812959]
Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs)<n>We propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets.<n>We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of 64k manually annotated QA pairs in seven languages.
arXiv Detail & Related papers (2024-07-13T09:34:00Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions. We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP. We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z)
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages. Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z)
Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture. Our models learn to generate labels in text based on the visual and textual inputs. Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.