Related papers: GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

URL: http://arxiv.org/abs/2508.03737v1
Date: Thu, 31 Jul 2025 18:24:05 GMT
Title: GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models
Authors: Ashutosh Bandooni, Brindha Subburaj,
Abstract summary: GanitBench is a benchmark consisting of 1527 vision-only questions covering several topics in Mathematics.<n>We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings.<n>GPT-4o mini is found to be the more dominant model on the benchmark, with it's highest average accuracy being 38.15%.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it's highest average accuracy being 38.15%. We also evaluate models through a "Double Lock" constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work.

Related papers

HinTel-AlignBench: A Framework and Benchmark for Hindi-Telugu with English-Aligned Samples [3.3715057550177145]
We present a scalable framework to evaluate Vision-Language Models (VLMs) in Indian languages and compare it with performance in English.<n>Using the framework, we generate HinTel-AlignBench, a benchmark that draws from diverse sources in Hindi and Telugu with English-aligned samples.<n>We find a regression in performance for tasks in English versus in Indian languages for 4 out of 5 tasks across all the models, with an average regression of 8.3 points in Hindi and 5.5 points for Telugu.
arXiv Detail & Related papers (2025-11-19T07:11:00Z)
mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models [2.0467354053171243]
We introduce textbfmmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's Chemistry Advanced examination ( 2019-2025)<n>Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters.
arXiv Detail & Related papers (2025-11-12T13:52:37Z)
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [74.4821011648997]
MAIA is a benchmark for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It considers twelve categories that aim to disentangle language and vision relations by highlighting the role of the visual input.<n>MAIA differs from other available video benchmarks for its design, its reasoning categories, the metric it uses, and the language and culture of the videos.
arXiv Detail & Related papers (2025-02-24T09:25:51Z)
HindiLLM: Large Language Model for Hindi [0.09363323206192666]
We have pre-trained two autoregressive Large Language Model (LLM) models for the Hindi language.<n>We use a two-step process comprising unsupervised pre-training and supervised fine-tuning.<n>The evaluation shows that the HindiLLM-based fine-tuned models outperform several models in most of the language related tasks.
arXiv Detail & Related papers (2024-12-29T05:28:15Z)
Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India. It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Language Models are Multilingual Chain-of-Thought Reasoners [83.37148309771378]
We introduce the Multilingual Grade School Math benchmark, by manually translating 250 grade-school math problems into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale. We show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment.
arXiv Detail & Related papers (2022-10-06T17:03:34Z)
IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages. We train and evaluate different self-supervised models alongside a commonly used baseline benchmark. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z)
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation [6.381149074212897]
We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We release a large-scale visual question answering dataset in Japanese and Hindi language. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.
arXiv Detail & Related papers (2022-06-07T14:46:30Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.