"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
- URL: http://arxiv.org/abs/2502.11718v4
- Date: Fri, 30 May 2025 09:27:22 GMT
- Title: "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
- Authors: Jihao Gu, Yingyao Wang, Pi Bu, Chen Wang, Ziming Wang, Tengtao Song, Donglai Wei, Jiale Yuan, Yingxiu Zhao, Yancheng He, Shilong Li, Jiaheng Liu, Meng Cao, Jun Song, Yingshui Tan, Xiang Li, Wenbo Su, Zhicheng Zheng, Xiaoyong Zhu, Bo Zheng,
- Abstract summary: We introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA.<n>Key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers.
- Score: 38.921977141721605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field. Our evaluation-friendly code and data have already been open-sourced.
Related papers
- Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z) - ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark [50.89916747049978]
Existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope.<n>We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data.<n>We propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs.
arXiv Detail & Related papers (2025-06-12T17:57:05Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images.
Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.
We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning [27.350370419751385]
Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery.
Two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models.
This paper introduces and analyzes BRSIC, a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions.
arXiv Detail & Related papers (2025-03-06T16:31:34Z) - InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models [30.986157664865534]
We introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark for evaluating the understanding of implicit meanings in images.
This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension.
Using this benchmark, we evaluate 15 open-source large vision language models (LVLMs) and GPT-4o, revealing that even the best-performing model lags behind human performance by nearly 14% in understanding implicit meaning.
arXiv Detail & Related papers (2025-02-19T13:42:37Z) - VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues [32.00120712945976]
VLM$2$-Bench is a benchmark designed to assess whether vision-language models can Visually Link Matching cues.<n>We identify critical challenges in models' ability to link visual cues, highlighting a significant performance gap where even GPT-4o lags 34.80% behind humans.
arXiv Detail & Related papers (2025-02-17T17:57:50Z) - Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models [24.47838086336772]
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions.
We focus on the Chinese language over 6 major topics with 99 diverse subtopics.
arXiv Detail & Related papers (2024-11-11T17:10:56Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - VisEval: A Benchmark for Data Visualization in the Era of Large Language Models [12.077276008688065]
Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language.
In this paper, we propose a new NL2VIS benchmark called VisEval.
This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths.
arXiv Detail & Related papers (2024-07-01T05:35:30Z) - FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs.
We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses.
Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - KoLA: Carefully Benchmarking World Knowledge of Large Language Models [87.96683299084788]
We construct a Knowledge-oriented LLM Assessment benchmark (KoLA)
We mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks.
We use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, to evaluate the capacity to handle unseen data and evolving knowledge.
arXiv Detail & Related papers (2023-06-15T17:20:46Z) - Retrieval-based Knowledge Augmented Vision Language Pre-training [9.779887832992435]
Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data.
In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework.
For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
arXiv Detail & Related papers (2023-04-27T02:23:47Z) - Intrinsic Knowledge Evaluation on Chinese Language Models [5.293979881130493]
This paper proposes four tasks on syntactic, semantic, commonsense, and factual knowledge, aggregating to a total of $39,308$ questions.
Our probes and knowledge data prove to be a reliable benchmark for evaluating pre-trained Chinese LMs.
arXiv Detail & Related papers (2020-11-29T04:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.