MaRU: A Manga Retrieval and Understanding System Connecting Vision and
Language
- URL: http://arxiv.org/abs/2311.02083v1
- Date: Sun, 22 Oct 2023 05:51:02 GMT
- Title: MaRU: A Manga Retrieval and Understanding System Connecting Vision and
Language
- Authors: Conghao Tom Shen, Violet Yao, Yixin Liu
- Abstract summary: MaRU (Manga Retrieval and Understanding) is a multi-staged system that connects vision and language to facilitate efficient search of both dialogues and scenes within Manga frames.
The architecture of MaRU integrates an object detection model for identifying text and frame bounding boxes, a text encoder for embedding text, and a vision- encoder that merges textual and visual information into a unified embedding space for scene retrieval.
- Score: 10.226184504988067
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manga, a widely celebrated Japanese comic art form, is renowned for its
diverse narratives and distinct artistic styles. However, the inherently visual
and intricate structure of Manga, which comprises images housing multiple
panels, poses significant challenges for content retrieval. To address this, we
present MaRU (Manga Retrieval and Understanding), a multi-staged system that
connects vision and language to facilitate efficient search of both dialogues
and scenes within Manga frames. The architecture of MaRU integrates an object
detection model for identifying text and frame bounding boxes, a Vision
Encoder-Decoder model for text recognition, a text encoder for embedding text,
and a vision-text encoder that merges textual and visual information into a
unified embedding space for scene retrieval. Rigorous evaluations reveal that
MaRU excels in end-to-end dialogue retrieval and exhibits promising results for
scene retrieval.
Related papers
- LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder.
Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z) - A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes.
It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning.
Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z) - The Manga Whisperer: Automatically Generating Transcriptions for Comics [55.544015596503726]
We present a unified model, Magi, that is able to detect panels, text boxes and character boxes.
We propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript.
arXiv Detail & Related papers (2024-01-18T18:59:09Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - Unleashing the Imagination of Text: A Novel Framework for Text-to-image
Person Retrieval via Exploring the Power of Words [0.951828574518325]
We propose a novel framework to explore the power of words in sentences.
The framework employs the pre-trained full CLIP model as a dual encoder for the images and texts.
We introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Unconstrained Text Detection in Manga: a New Dataset and Baseline [3.04585143845864]
This work aims to binarize text in a comic genre with highly sophisticated text styles: Japanese manga.
To overcome the lack of a manga dataset with text annotations at a pixel level, we create our own.
Using these resources, we designed and evaluated a deep network model, outperforming current methods for text binarization in manga in most metrics.
arXiv Detail & Related papers (2020-09-09T00:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.