Deep multi-modal networks for book genre classification based on its
cover
- URL: http://arxiv.org/abs/2011.07658v1
- Date: Sun, 15 Nov 2020 23:27:43 GMT
- Title: Deep multi-modal networks for book genre classification based on its
cover
- Authors: Chandra Kundu, Lukun Zheng
- Abstract summary: We propose a multi-modal deep learning framework to solve the cover-based book classification problem.
Our method adds an extra modality by extracting texts automatically from the book covers.
Results show that the multi-modal framework significantly outperforms the current state-of-the-art image-based models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Book covers are usually the very first impression to its readers and they
often convey important information about the content of the book. Book genre
classification based on its cover would be utterly beneficial to many modern
retrieval systems, considering that the complete digitization of books is an
extremely expensive task. At the same time, it is also an extremely challenging
task due to the following reasons: First, there exists a wide variety of book
genres, many of which are not concretely defined. Second, book covers, as
graphic designs, vary in many different ways such as colors, styles, textual
information, etc, even for books of the same genre. Third, book cover designs
may vary due to many external factors such as country, culture, target reader
populations, etc. With the growing competitiveness in the book industry, the
book cover designers and typographers push the cover designs to its limit in
the hope of attracting sales. The cover-based book classification systems
become a particularly exciting research topic in recent years. In this paper,
we propose a multi-modal deep learning framework to solve this problem. The
contribution of this paper is four-fold. First, our method adds an extra
modality by extracting texts automatically from the book covers. Second,
image-based and text-based, state-of-the-art models are evaluated thoroughly
for the task of book cover classification. Third, we develop an efficient and
salable multi-modal framework based on the images and texts shown on the covers
only. Fourth, a thorough analysis of the experimental results is given and
future works to improve the performance is suggested. The results show that the
multi-modal framework significantly outperforms the current state-of-the-art
image-based models. However, more efforts and resources are needed for this
classification task in order to reach a satisfactory level.
Related papers
- Panel Transitions for Genre Analysis in Visual Narratives [1.320904960556043]
We present a novel approach to do a multi-modal analysis of genre based on comics and manga-style visual narratives.
We highlight some of the limitations and challenges of our existing computational approaches in modeling subjective labels.
arXiv Detail & Related papers (2023-12-14T08:05:09Z) - Interleaving GANs with knowledge graphs to support design creativity for
book covers [77.34726150561087]
We apply Generative Adversarial Networks (GANs) to the book covers domain.
We interleave GANs with knowledge graphs to alter the input title to obtain multiple possible options for any given title.
Finally, we use the discriminator obtained during the training phase to select the best images generated with new titles.
arXiv Detail & Related papers (2023-08-03T08:56:56Z) - Enhancing Textbooks with Visuals from the Web for Improved Learning [50.01434477801967]
In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web.
We collect a dataset of e-textbooks in the math, science, social science and business domains.
We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks.
arXiv Detail & Related papers (2023-04-18T12:16:39Z) - Book Cover Synthesis from the Summary [0.0]
We explore ways to produce a book cover using artificial intelligence based on the fact that there exists a relationship between the summary of the book and its cover.
We construct a dataset of English books that contains a large number of samples of summaries of existing books and their cover images.
We apply different text-to-image synthesis techniques to generate book covers from the summary and exhibit the results in this paper.
arXiv Detail & Related papers (2022-11-03T20:43:40Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Using Full-Text Content to Characterize and Identify Best Seller Books [0.6442904501384817]
We consider the task of predicting whether a book will become a best seller from the standpoint of literary works.
Dissimilarly from previous approaches, we focused on the full content of books and considered visualization and classification tasks.
Our results show that it is unfeasible to predict the success of books with high accuracy using only the full content of the texts.
arXiv Detail & Related papers (2022-10-05T15:40:25Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Font Style that Fits an Image -- Font Generation Based on Image Context [7.646713951724013]
We propose a method of generating a book title image based on its context within a book cover.
We propose an end-to-end neural network that inputs the book cover, a target location mask, and a desired book title and outputs stylized text suitable for the cover.
We demonstrate that the proposed method can effectively produce desirable and appropriate book cover text through quantitative and qualitative results.
arXiv Detail & Related papers (2021-05-19T01:53:04Z) - Deep Learning for Scene Classification: A Survey [48.57123373347695]
Scene classification is a longstanding, fundamental and challenging problem in computer vision.
The rise of large-scale datasets and the renaissance of deep learning techniques have brought remarkable progress in the field of scene representation and classification.
This paper provides a comprehensive survey of recent achievements in scene classification using deep learning.
arXiv Detail & Related papers (2021-01-26T03:06:50Z) - Deep learning for video game genre classification [2.66512000865131]
This paper proposes a multi-modal deep learning framework to solve this problem.
We compiles a large dataset consisting of 50,000 video games from 21 genres made of cover images, description text, and title text and the genre information.
Results show that the multi-modal framework outperforms the current state-of-the-art image-based or text-based models.
arXiv Detail & Related papers (2020-11-21T22:31:43Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.