SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark
- URL: http://arxiv.org/abs/2402.05138v1
- Date: Tue, 6 Feb 2024 19:16:55 GMT
- Title: SceMQA: A Scientific College Entrance Level Multimodal Question
Answering Benchmark
- Authors: Zhenwen Liang, Kehan Guo, Gang Liu, Taicheng Guo, Yujun Zhou, Tianyu
Yang, Jiajun Jiao, Renjie Pi, Jipeng Zhang, Xiangliang Zhang
- Abstract summary: The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level.
SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology.
It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models' abilities.
- Score: 42.91902601376494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paper introduces SceMQA, a novel benchmark for scientific multimodal
question answering at the college entrance level. It addresses a critical
educational phase often overlooked in existing benchmarks, spanning high school
to pre-college levels. SceMQA focuses on core science subjects including
Mathematics, Physics, Chemistry, and Biology. It features a blend of
multiple-choice and free-response formats, ensuring a comprehensive evaluation
of AI models' abilities. Additionally, our benchmark provides specific
knowledge points for each problem and detailed explanations for each answer.
SceMQA also uniquely presents problems with identical contexts but varied
questions to facilitate a more thorough and accurate assessment of reasoning
capabilities. In the experiment, we evaluate both open-source and close-source
state-of-the-art Multimodal Large Language Models (MLLMs), across various
experimental settings. The results show that further research and development
are needed in developing more capable MLLM, as highlighted by only 50% to 60%
accuracy achieved by the strongest models. Our benchmark and analysis will be
available at https://scemqa.github.io/
Related papers
- CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning [32.811840681428464]
Multi-modal large language models (MLLMs) have demonstrated promising capabilities across various tasks.
We present a detailed evaluation of the performance of 25 representative MLLMs in scientific reasoning.
The best performance observed include a 53.4% accuracy in mathematics by Claude3.5-Sonnet, 38.2% in physics by GPT-4o, and 47.0% in chemistry by Gemini-1.5-Pro.
arXiv Detail & Related papers (2024-09-10T01:20:26Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning [16.032320995230734]
We introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese.
CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school.
We propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions.
arXiv Detail & Related papers (2024-01-25T08:22:10Z) - RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal
Sentiment Classification [70.9087014537896]
Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars.
To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets.
arXiv Detail & Related papers (2023-10-14T14:52:37Z) - AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models [3.518832148294879]
How to evaluate the question-solving abilities of large language models like ChatGPT is a hot-spot but challenging issue.
We propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs.
arXiv Detail & Related papers (2023-09-05T13:43:37Z) - M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
Large Language Models [76.88692952308084]
M3Exam is a benchmark for evaluating large language models (LLMs) in a multilingual, multimodal, and multilevel context.
M3Exam contains 12,317 questions in 9 diverse languages with three educational levels.
We assess the performance of top-performing LLMs on M3Exam and find that current models, including GPT-4, still struggle with multilingual text.
arXiv Detail & Related papers (2023-06-08T13:21:29Z) - M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark
for Chinese Large Language Models [35.17226595231825]
M3KE is a Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark.
It is developed to measure knowledge acquired by Chinese large language models.
We have collected 20,477 questions from 71 tasks.
arXiv Detail & Related papers (2023-05-17T14:56:31Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.