DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts
- URL: http://arxiv.org/abs/2503.19498v4
- Date: Wed, 10 Sep 2025 02:18:09 GMT
- Title: DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts
- Authors: Yujing Lu, Ling Zhong, Jing Yang, Weiming Li, Peng Wei, Yongheng Wang, Manni Duan, Qing Zhang,
- Abstract summary: Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data.<n>We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning.
- Score: 24.157256695111872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chart Question Answering (CQA) evaluates Multimodal Large Language Models (MLLMs) on visual understanding and reasoning over chart data. However, existing benchmarks mostly test surface-level parsing, such as reading labels and legends, while overlooking deeper scientific reasoning. We propose DomainCQA, a framework for constructing domain-specific CQA benchmarks that emphasize both visual comprehension and knowledge-intensive reasoning. It integrates complexity-aware chart selection, multitier QA generation, and expert validation. Applied to astronomy, DomainCQA yields AstroChart, a benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in fine-grained perception, numerical reasoning, and domain knowledge integration across 21 MLLMs. Fine-tuning on AstroChart improves performance across fundamental and advanced tasks. Pilot QA sets in biochemistry, economics, medicine, and social science further demonstrate DomainCQA's generality. Together, our results establish DomainCQA as a unified pipeline for constructing and augmenting domain-specific chart reasoning benchmarks.
Related papers
- Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking [56.27361644734853]
Knowledge Graph Question Answering systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning.<n>Despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues.<n>We introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls.<n>Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
arXiv Detail & Related papers (2025-05-29T14:44:52Z) - ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering [14.468507852394923]
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models.<n>We introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings.<n>We propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements.
arXiv Detail & Related papers (2025-05-29T08:46:03Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks [0.0]
We propose MEQA, a framework for the meta-evaluation of question and answer (QA) benchmarks.<n>We demonstrate this approach on cybersecurity benchmarks, using human and LLM evaluators.<n>We motivate our choice of test domain by AI models' dual nature as powerful defensive tools and security threats.
arXiv Detail & Related papers (2025-04-18T19:01:53Z) - ExpertGenQA: Open-ended QA generation in Specialized Domains [9.412082058055823]
ExpertGenQA is a protocol that combines few-shot learning with structured topic and style categorization to generate comprehensive domain-specific QA pairs.
We show that ExpertGenQA achieves twice the efficiency of baseline few-shot approaches while maintaining $94.4%$ topic coverage.
arXiv Detail & Related papers (2025-03-04T19:09:48Z) - TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z) - Discerning and Characterising Types of Competency Questions for Ontologies [0.4757470449749875]
Competency Questions (CQs) are widely used in ontology development by guiding, among others, the scoping and validation stages.
Very limited guidance exists for formulating CQs and assessing whether they are good CQs, leading to issues such as ambiguity and unusable formulations.
This paper contributes to such theoretical foundations by analysing questions, their uses, and the myriad of development tasks.
arXiv Detail & Related papers (2024-12-18T10:26:29Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - A Comprehensive Survey of Action Quality Assessment: Method and Benchmark [25.694556140797832]
Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment.<n>Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains.<n>The lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches.
arXiv Detail & Related papers (2024-12-15T10:47:26Z) - Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment [69.07445098168344]
We introduce a new image quality assessment (IQA) task paradigm, grounding-IQA.<n>Grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA)<n>To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline.<n>Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application.
arXiv Detail & Related papers (2024-11-26T09:03:16Z) - KaPQA: Knowledge-Augmented Product Question-Answering [59.096607961704656]
We introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products.
We also propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task.
arXiv Detail & Related papers (2024-07-22T22:14:56Z) - Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs [33.87001216244801]
Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions.<n>We introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories.<n>We have conducted extensive experiments to verify the effectiveness of CAQA.
arXiv Detail & Related papers (2024-01-26T04:11:07Z) - MarkQA: A large scale KBQA dataset with numerical reasoning [11.072552105311484]
We propose a new task, NR-KBQA, which requires the ability to perform both multi-hop reasoning and numerical reasoning.
We design a logic form in Python format called PyQL to represent the reasoning process of numerical reasoning questions.
We present a large dataset called MarkQA, which is automatically constructed from a small set of seeds.
arXiv Detail & Related papers (2023-10-24T04:50:59Z) - Empower Large Language Model to Perform Better on Industrial
Domain-Specific Question Answering [36.31193273252256]
Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks.
But its performance in real industrial domain-specific scenarios is average due to its lack of specific domain knowledge.
We provide a benchmark Question Answering (QA) dataset named MSQA, centered around Microsoft products and IT technical problems encountered by customers.
arXiv Detail & Related papers (2023-05-19T09:23:25Z) - ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset [26.782937852417454]
We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions.
We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores.
We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
arXiv Detail & Related papers (2021-09-22T09:17:09Z) - Relation-Guided Pre-Training for Open-Domain Question Answering [67.86958978322188]
We propose a Relation-Guided Pre-Training (RGPT-QA) framework to solve complex open-domain questions.
We show that RGPT-QA achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions.
arXiv Detail & Related papers (2021-09-21T17:59:31Z) - RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base
Question Answering [57.94658176442027]
We present RnG-KBQA, a Rank-and-Generate approach for KBQA.
We achieve new state-of-the-art results on GrailQA and WebQSP datasets.
arXiv Detail & Related papers (2021-09-17T17:58:28Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.