Related papers: ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

URL: http://arxiv.org/abs/2410.14991v1
Date: Sat, 19 Oct 2024 05:45:21 GMT
Title: ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
Authors: Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Farhan Ishmam, Fabiha Haider, Fariha Tanjim Shifat, Md Fahim, Md Farhad Alam,
Abstract summary: We introduce a large-scale Bangla VQA dataset titled ChitroJera, totaling over 15k samples. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of a proper benchmark dataset. The absence of such datasets challenges models that are known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little cultural relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset titled ChitroJera, totaling over 15k samples where diverse and locally relevant data sources are used. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of its scale. We also evaluate the performance of large language models (LLMs) using prompt-based techniques, with LLMs achieving the best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.

Related papers

Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning [26.89241254462218]
We introduce Vision-Caption aware Supervised FineTuning (VCASFT)<n>VCASFT is a learning paradigm designed to enhance the performance of smaller Vision Language Models (VLMs)<n>We benchmark it on ScienceQA, which consists of questions across diverse languages, subjects, and fields.<n>To further demonstrate the effectiveness of this technique on lowresource languages, we developed HiSciVQA, a dataset comprising 2,245 high-quality, hand-annotated Hindi multimodal Q&A pairs.
arXiv Detail & Related papers (2025-09-20T11:07:36Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families. We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance. We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z)
Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models [36.56689822791777]
Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions. Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction method. Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the state-of-the-art on three different KBVQA datasets.
arXiv Detail & Related papers (2024-06-14T13:07:46Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [57.30218240464696]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.<n>MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z)
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [40.972648044298374]
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks. They often lack interpretability and struggle with complex visual inputs. We introduce the large-scale Visual CoT dataset comprising 438k question-answer pairs. We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts.
arXiv Detail & Related papers (2024-03-25T17:59:23Z)
ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese. We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG) MuRAG accesses an external non-parametric multimodal memory to augment language generation. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance. We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z)
xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages. We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z)
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering [8.558954185502012]
We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data. We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr)
arXiv Detail & Related papers (2020-10-23T20:09:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.