Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
- URL: http://arxiv.org/abs/2504.11101v2
- Date: Wed, 16 Apr 2025 03:22:14 GMT
- Title: Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
- Authors: Yulong Zhang, Tianyi Liang, Xinyue Huang, Erfei Cui, Xu Guo, Pei Chu, Chenhui Li, Ru Zhang, Wenhai Wang, Gongshen Liu,
- Abstract summary: We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty.<n>We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths.
- Score: 30.240680920617447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.
Related papers
- Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z) - Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision [49.46606936180063]
Video quality assessment (VQA) is essential for quantifying quality in various video processing systems.<n>We introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos.<n>By training on a dataset $10times$ larger than the existing VQA benchmarks, our model achieves zero-shot performance.
arXiv Detail & Related papers (2025-05-06T15:29:32Z) - Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models [9.238739743596236]
We propose a novel score model trained on large-scale RS visionlanguage preference data for automated quality assessment.<n>Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by our score model achieves superior interpretation accuracy.
arXiv Detail & Related papers (2025-03-02T05:44:56Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach [31.654345704242512]
This paper introduces a novel, model-level judge-free self-improvement framework.<n>Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop.<n>We achieve superior precision and recall with significantly lower computational demands.
arXiv Detail & Related papers (2024-11-26T00:44:37Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks [12.313257689227013]
This paper introduces Selective Feature Imputation for Debiasing (SFID), a novel methodology that integrates feature pruning and low confidence imputation.
SFID is versatile, maintaining the semantic integrity of outputs and costly effective by eliminating the need for retraining.
Our experimental results demonstrate SFID's effectiveness across various VLMs tasks including zero-shot classification, text-to-image retrieval, image captioning, and text-to-image generation.
arXiv Detail & Related papers (2024-10-10T03:57:48Z) - MM-R$^3$: On (In-)Consistency of Vision-Language Models (VLMs) [26.475993408532304]
We analyze performance of SoTA Vision Language Models on three tasks: Question Rephrasing, Image Restyling, and Context Reasoning.<n>Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.<n>We propose a simple yet effective mitigation strategy in the form of an adapter module trained to minimize inconsistency across prompts.
arXiv Detail & Related papers (2024-10-07T06:36:55Z) - Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language [41.40908753726324]
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks.<n>We present textbfAuto textbfCherry-textbfPicker (ACP), a novel framework that generates high-quality cross-modality training samples at scale.
arXiv Detail & Related papers (2024-06-28T17:53:18Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Misconfidence-based Demonstration Selection for LLM In-Context Learning [0.0]
In-context learning with large language models (LLMs) excels at adapting to various tasks rapidly.
Current approaches to this problem either rely on hard-to-acquire external supervision or require frequent interactions with LLMs.
We propose a new method called In-Context Reflection (ICR) to overcome these challenges.
arXiv Detail & Related papers (2024-01-12T00:11:24Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.