MAP-Elites with Transverse Assessment for Multimodal Problems in
Creative Domains
- URL: http://arxiv.org/abs/2403.07182v1
- Date: Mon, 11 Mar 2024 21:50:22 GMT
- Title: MAP-Elites with Transverse Assessment for Multimodal Problems in
Creative Domains
- Authors: Marvin Zammit, Antonios Liapis, Georgios N. Yannakakis
- Abstract summary: We propose a novel approach to handle multimodal creative tasks using Quality Diversity evolution.
Our contribution is a variation of the MAP-Elites algorithm, MAP-Elites with Transverse Assessment (MEliTA)
MEliTA decouples the artefacts' modalities and promotes cross-pollination between elites.
- Score: 2.7869568828212175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent advances in language-based generative models have paved the way
for the orchestration of multiple generators of different artefact types (text,
image, audio, etc.) into one system. Presently, many open-source pre-trained
models combine text with other modalities, thus enabling shared vector
embeddings to be compared across different generators. Within this context we
propose a novel approach to handle multimodal creative tasks using Quality
Diversity evolution. Our contribution is a variation of the MAP-Elites
algorithm, MAP-Elites with Transverse Assessment (MEliTA), which is tailored
for multimodal creative tasks and leverages deep learned models that assess
coherence across modalities. MEliTA decouples the artefacts' modalities and
promotes cross-pollination between elites. As a test bed for this algorithm, we
generate text descriptions and cover images for a hypothetical video game and
assign each artefact a unique modality-specific behavioural characteristic.
Results indicate that MEliTA can improve text-to-image mappings within the
solution space, compared to a baseline MAP-Elites algorithm that strictly
treats each image-text pair as one solution. Our approach represents a
significant step forward in multimodal bottom-up orchestration and lays the
groundwork for more complex systems coordinating multimodal creative agents in
the future.
Related papers
- Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction [16.475718456640784]
We propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction.
VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2024-04-18T08:56:47Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - MMoE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts [92.76662894585809]
We introduce an approach to enhance multimodal models, which we call Multimodal Mixtures of Experts (MMoE)
MMoE is able to be applied to various types of models to gain improvement.
arXiv Detail & Related papers (2023-11-16T05:31:21Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Improving Cross-modal Alignment for Text-Guided Image Inpainting [36.1319565907582]
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image.
We propose a novel model for TGII by improving cross-modal alignment.
Our model achieves state-of-the-art performance compared with other strong competitors.
arXiv Detail & Related papers (2023-01-26T19:18:27Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Abstractive Sentence Summarization with Guidance of Selective Multimodal
Reference [3.505062507621494]
We propose a Multimodal Hierarchical Selective Transformer (mhsf) model that considers reciprocal relationships among modalities.
We evaluate the generalism of proposed mhsf model with the pre-trained+fine-tuning and fresh training strategies.
arXiv Detail & Related papers (2021-08-11T09:59:34Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.