PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
- URL: http://arxiv.org/abs/2602.19188v1
- Date: Sun, 22 Feb 2026 13:36:48 GMT
- Title: PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration
- Authors: Chen Duan, Zhentao Guo, Pei Fu, Zining Wang, Kai Zhou, Pengfei Yan,
- Abstract summary: We introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning.<n>This framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting.
- Score: 17.887453138676964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.
Related papers
- Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval? [8.45007357012084]
We investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers.<n>Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics.<n>We find that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance.
arXiv Detail & Related papers (2025-12-22T07:36:20Z) - ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints [42.713620384054146]
This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability.<n>We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation.
arXiv Detail & Related papers (2025-11-23T13:42:22Z) - Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes [0.0]
We compare fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images.<n>A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data to achieve up to 36% accuracy improvement.<n>Our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies.
arXiv Detail & Related papers (2025-10-03T18:53:18Z) - MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z) - Distributed LLMs and Multimodal Large Language Models: A Survey on Advances, Challenges, and Future Directions [1.3638337521666275]
Language models (LMs) are machine learning models designed to predict linguistic patterns by estimating the probability of word sequences based on large-scale datasets, such as text.<n>Although larger datasets typically enhance LM performance, scalability remains a challenge due to constraints in computational power and resources.<n>Recent research has focused on developing decentralized techniques to enable distributed training and inference.
arXiv Detail & Related papers (2025-03-20T15:18:25Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - AgentPS: Agentic Process Supervision for Content Moderation with Multimodal LLMs [9.35901507816989]
We introduce AgentPS, a framework that integrates Agentic Process Supervision into large language models.<n>We show that AgentPS achieves substantial improvements over baseline MLLMs on public benchmarks and proprietary datasets.<n>These results establish AgentPS as a scalable and effective solution for complex multimodal classification in large-scale industrial applications.
arXiv Detail & Related papers (2024-12-15T04:58:00Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT [9.682499180341273]
Large language models (LLMs) have significantly advanced text generation, but the human-like quality of their outputs presents major challenges.<n>We propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in Chinese and English.<n>This framework supports scalable, reproducible experiments and enables analysis of how operational diversity, multilingual training sets, and LLM architectures influence detection performance.
arXiv Detail & Related papers (2024-06-13T12:43:40Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.