Related papers: CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books

URL: http://arxiv.org/abs/2507.10053v1
Date: Mon, 14 Jul 2025 08:35:37 GMT
Title: CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books
Authors: Marc Serra Ortega, Emanuele Vivoli, Artemis Llabrés, Dimosthenis Karatzas,
Abstract summary: CoSMo is a novel Transformer for Page Stream (PSS) in comic books, a critical task for automated content understanding.<n>We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset.<n>CoSMo consistently outperforms traditional baselines and significantly larger general-purpose vision-language models.
Score: 7.887803138420098
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper introduces CoSMo, a novel multimodal Transformer for Page Stream Segmentation (PSS) in comic books, a critical task for automated content understanding, as it is a necessary first stage for many downstream tasks like character analysis, story indexing, or metadata enrichment. We formalize PSS for this unique medium and curate a new 20,800-page annotated dataset. CoSMo, developed in vision-only and multimodal variants, consistently outperforms traditional baselines and significantly larger general-purpose vision-language models across F1-Macro, Panoptic Quality, and stream-level metrics. Our findings highlight the dominance of visual features for comic PSS macro-structure, yet demonstrate multimodal benefits in resolving challenging ambiguities. CoSMo establishes a new state-of-the-art, paving the way for scalable comic book analysis.

Related papers

Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [44.008094698200026]
Cross-modal retrieval is gaining increasing efficacy and interest from the research community.<n>In this paper, we design an approach that allows for multimodal queries composed of both an image and a text.<n>Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones.
arXiv Detail & Related papers (2025-03-03T19:01:17Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation [5.528860524494717]
This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved.
arXiv Detail & Related papers (2024-10-04T04:59:50Z)
Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem. We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z)
Multimodal Transformer for Comics Text-Cloze [8.616858272810084]
Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants.
arXiv Detail & Related papers (2024-03-06T14:11:45Z)
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation. We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z)
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization [23.475411831792716]
We propose ViL-Sum to jointly model paragraph-level textbfVision-textbfLanguage Semantic Alignment and Multi-Modal textbfSummarization. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering and image selection. Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2022-08-24T05:18:23Z)
MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains. Our method contains video and textual segmentation and summarization module, respectively. It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z)
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions. Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.