Related papers: CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding

URL: http://arxiv.org/abs/2407.03550v1
Date: Thu, 4 Jul 2024 00:07:50 GMT
Title: CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding
Authors: Emanuele Vivoli, Marco Bertini, Dimosthenis Karatzas,
Abstract summary: We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books.
Score: 14.22900011952181
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The comic domain is rapidly advancing with the development of single-page analysis and synthesis models. However, evaluation metrics and datasets lag behind, often limited to small-scale or single-style test sets. We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis. Unlike existing benchmarks that focus on isolated tasks such as object detection or text recognition, CoMix addresses a broader range of tasks including object detection, speaker identification, character re-identification, reading order, and multi-modal reasoning tasks like character naming and dialogue generation. Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation. To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books, thereby enriching the diversity of comic styles. CoMix is designed to assess pre-trained models in zero-shot and limited fine-tuning settings, probing their transfer capabilities across different comic styles and tasks. The validation split of the benchmark is publicly available for research purposes, and an evaluation server for the held-out test split is also provided. Comparative results between human performance and state-of-the-art models reveal a significant performance gap, highlighting substantial opportunities for advancements in comic understanding. The dataset, baseline models, and code are accessible at the repository link. This initiative sets a new standard for comprehensive comic analysis, providing the community with a common benchmark for evaluation on a large and varied set.

Related papers

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
Comics Datasets Framework: Mix of Comics datasets for detection benchmarking [11.457653763760792]
Comics as a medium uniquely combine text and images in styles often distinct from real-world visuals. computational research on comics has evolved from basic object detection to more sophisticated tasks. We aim to standardize annotations across datasets, introduce a variety of comic styles into the datasets, and establish benchmark results with clear, replicable settings.
arXiv Detail & Related papers (2024-07-03T23:07:57Z)
Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights [44.11628188443046]
A Graph Foundation Model (GFM) can work well across different graphs and tasks with a unified backbone. Inspired by multi-modal models that align different modalities with natural language, the text has recently been adopted to provide a unified feature space for diverse graphs. Despite the great potential of these text-space GFMs, current research in this field is hampered by two problems.
arXiv Detail & Related papers (2024-06-15T19:56:21Z)
BlendX: Complex Multi-Intent Detection with Blended Patterns [4.852816974803059]
We present BlendX, a suite of refined datasets featuring more diverse patterns than their predecessors. For dataset construction, we utilize both rule-baseds and a generative tool -- OpenAI's ChatGPT -- which is augmented with a similarity-driven strategy for utterance selection. Experiments on BlendX reveal that state-of-the-art MID models struggle with the challenges posed by the new datasets.
arXiv Detail & Related papers (2024-03-27T06:13:04Z)
Multi-Review Fusion-in-Context [20.681734117825822]
Grounded text generation requires both content selection and content consolidation. Recent works have proposed a modular approach, with separate components for each step. This study lays the groundwork for further exploration of modular text generation in the multi-document setting.
arXiv Detail & Related papers (2024-03-22T17:06:05Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization [54.59104881168188]
textscUniSumm is a unified few-shot summarization model pre-trained with multiple summarization tasks. textscSummZoo is a new benchmark to better evaluate few-shot summarizers.
arXiv Detail & Related papers (2022-11-17T18:54:47Z)
Evaluating and Improving Factuality in Multimodal Abstractive Summarization [91.46015013816083]
We propose CLIPBERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary. We show that this simple combination of two metrics in the zero-shot achieves higher correlations than existing factuality metrics for document summarization. Our analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks.
arXiv Detail & Related papers (2022-11-04T16:50:40Z)
OpenMixup: Open Mixup Toolbox and Benchmark for Visual Representation Learning [53.57075147367114]
We introduce OpenMixup, the first mixup augmentation and benchmark for visual representation learning. We train 18 representative mixup baselines from scratch and rigorously evaluate them across 11 image datasets. We also open-source our modular backbones, including a collection of popular vision backbones, optimization strategies, and analysis toolkits.
arXiv Detail & Related papers (2022-09-11T12:46:01Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Free Lunch for Co-Saliency Detection: Context Adjustment [14.688461235328306]
We propose a "cost-free" group-cut-paste (GCP) procedure to leverage images from off-the-shelf saliency detection datasets and synthesize new samples. We collect a novel dataset called Context Adjustment Training. The two variants of our dataset, i.e., CAT and CAT+, consist of 16,750 and 33,500 images, respectively.
arXiv Detail & Related papers (2021-08-04T14:51:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.