X-modaler: A Versatile and High-performance Codebase for Cross-modal
Analytics
- URL: http://arxiv.org/abs/2108.08217v1
- Date: Wed, 18 Aug 2021 16:05:30 GMT
- Title: X-modaler: A Versatile and High-performance Codebase for Cross-modal
Analytics
- Authors: Yehao Li and Yingwei Pan and Jingwen Chen and Ting Yao and Tao Mei
- Abstract summary: X-modaler encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages.
X-modaler is an Apache-licensed, and its source codes, sample projects and pre-trained models are available on-line.
- Score: 99.03895740754402
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rise and development of deep learning over the past decade, there
has been a steady momentum of innovation and breakthroughs that convincingly
push the state-of-the-art of cross-modal analytics between vision and language
in multimedia field. Nevertheless, there has not been an open-source codebase
in support of training and deploying numerous neural network models for
cross-modal analytics in a unified and modular fashion. In this work, we
propose X-modaler -- a versatile and high-performance codebase that
encapsulates the state-of-the-art cross-modal analytics into several
general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction,
decoder, and decode strategy). Each stage is empowered with the functionality
that covers a series of modules widely adopted in state-of-the-arts and allows
seamless switching in between. This way naturally enables a flexible
implementation of state-of-the-art algorithms for image captioning, video
captioning, and vision-language pre-training, aiming to facilitate the rapid
development of research community. Meanwhile, since the effective modular
designs in several stages (e.g., cross-modal interaction) are shared across
different vision-language tasks, X-modaler can be simply extended to power
startup prototypes for other tasks in cross-modal analytics, including visual
question answering, visual commonsense reasoning, and cross-modal retrieval.
X-modaler is an Apache-licensed codebase, and its source codes, sample projects
and pre-trained models are available on-line:
https://github.com/YehLi/xmodaler.
Related papers
- X-VILA: Cross-Modality Alignment for Large Language Model [91.96081978952283]
X-VILA is an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
We propose a visual alignment mechanism with a visual embedding highway module to address the problem of visual information loss.
X-VILA exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins.
arXiv Detail & Related papers (2024-05-29T17:59:58Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Cross-modal Prototype Driven Network for Radiology Report Generation [30.029659845237077]
Radiology report generation (RRG) aims to describe automatically a radiology image with human-like language and could potentially support the work of radiologists.
Previous approaches often adopt an encoder-decoder architecture and focus on single-modal feature learning.
Here we propose a Cross-modal PROtotype driven NETwork (XPRONET) to promote cross-modal pattern learning and exploit it to improve the task of radiology report generation.
arXiv Detail & Related papers (2022-07-11T12:29:33Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.