Related papers: Kosmos-2.5: A Multimodal Literate Model

Kosmos-2.5: A Multimodal Literate Model

URL: http://arxiv.org/abs/2309.11419v1
Date: Wed, 20 Sep 2023 15:50:08 GMT
Title: Kosmos-2.5: A Multimodal Literate Model
Authors: Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, Yaoyao Chang, Shaohan Huang, Wenhui Wang, Li Dong, Weiyao Luo, Shaoxiang Wu, Guoxin Wang, Cha Zhang, Furu Wei
Abstract summary: Kosmos-2.5 is a multimodal literate model for machine reading of text-intensive images. It excels in two distinct yet cooperative transcription tasks. It can be adapted for any text-intensive image understanding task with different prompts.
Score: 143.4565835051535
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

Related papers

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs [52.79503055897109]
We present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation. We propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions.
arXiv Detail & Related papers (2025-04-11T08:46:49Z)
HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis [21.25786478579275]
Handwritten document recognition is one of the most challenging tasks in computer vision. Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis. This paper introduces HAND, a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks.
arXiv Detail & Related papers (2024-12-25T20:36:29Z)
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output [138.18086961321146]
InternLM-XComposer-2.5 (IXC-2.5) is a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks.
arXiv Detail & Related papers (2024-07-03T17:59:21Z)
LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation [30.897935761304034]
We propose a novel framework called textbfLLM4GEN, which enhances the semantic understanding of text-to-image diffusion models. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features. DensePrompts, which contains $7,000$ dense prompts, provides a comprehensive evaluation for the text-to-image generation task.
arXiv Detail & Related papers (2024-06-30T15:50:32Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions. By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z)
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model [108.85584502396182]
We propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM) By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters. Our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks.
arXiv Detail & Related papers (2023-10-08T11:33:09Z)
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts. Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z)
DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives. We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.