Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine
Comprehension
- URL: http://arxiv.org/abs/2204.02566v1
- Date: Wed, 6 Apr 2022 03:41:13 GMT
- Title: Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine
Comprehension
- Authors: Huibin Zhang and Zhengkun Zhang and Yao Zhang and Jun Wang and Yufan
Li and Ning jiang and Xin wei and Zhenglu Yang
- Abstract summary: Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step.
In this study, we approach Procedural MultiModal Machine (M3C) at a fine-grained level (compared with existing explorations at a document or sentence level) that is, entity.
- Score: 23.281727955934304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Procedural Multimodal Documents (PMDs) organize textual instructions and
corresponding images step by step. Comprehending PMDs and inducing their
representations for the downstream reasoning tasks is designated as Procedural
MultiModal Machine Comprehension (M3C). In this study, we approach Procedural
M3C at a fine-grained level (compared with existing explorations at a document
or sentence level), that is, entity. With delicate consideration, we model
entity both in its temporal and cross-modal relation and propose a novel
Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated
to capture textual and visual entities and trace their temporal-modal
evolution. In addition, a graph aggregation module is introduced to conduct
graph encoding and reasoning. Comprehensive experiments across three Procedural
M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset
CraftQA, which can better evaluate the generalization of TMEG.
Related papers
- MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion [51.80447197290866]
We introduce MyGO to process, fuse, and augment the fine-grained modality information from MMKGs.
MyGO tokenizes multi-modal raw data as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder.
Experiments on standard MMKGC benchmarks reveal that our method surpasses 20 of the latest models.
arXiv Detail & Related papers (2024-04-15T05:40:41Z) - mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning [8.1113308714581]
This paper introduces a novel multimodal chart question-answering model.
Our model integrates visual and linguistic processing, overcoming the constraints of existing methods.
This approach has demonstrated superior performance on multiple public datasets.
arXiv Detail & Related papers (2024-04-02T01:28:44Z) - MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts.
We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types.
We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - Graph-Text Multi-Modal Pre-training for Medical Representation Learning [7.403725826586844]
We present MedGTX, a pre-trained model for multi-modal representation learning of structured and textual EHR data.
We pre-train our model through four proxy tasks on MIMIC-III, an open-source EHR data.
The results consistently show the effectiveness of pre-training the model for joint representation of both structured and unstructured information from EHR.
arXiv Detail & Related papers (2022-03-18T14:45:42Z) - Knowledge Perceived Multi-modal Pretraining in E-commerce [12.012793707741562]
Current multi-modal pretraining methods for image and text modalities lack robustness in the face of modality-missing and modality-noise.
We propose K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities.
arXiv Detail & Related papers (2021-08-20T08:01:28Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.