Related papers: Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

URL: http://arxiv.org/abs/2204.02566v1
Date: Wed, 6 Apr 2022 03:41:13 GMT
Title: Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension
Authors: Huibin Zhang and Zhengkun Zhang and Yao Zhang and Jun Wang and Yufan Li and Ning jiang and Xin wei and Zhenglu Yang
Abstract summary: Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. In this study, we approach Procedural MultiModal Machine (M3C) at a fine-grained level (compared with existing explorations at a document or sentence level) that is, entity.
Score: 23.281727955934304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.

Related papers

MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning [57.42710816140401]
A promising approach uses code as an intermediate representation to precisely express and manipulate the images in the reasoning steps.<n>Existing evaluations focus mainly on text-only reasoning outputs, leaving the MLLM's ability to perform accurate visual operations via code largely unexplored.<n>This work takes a first step toward addressing that gap by evaluating MLLM's code-based capabilities in multi-modal mathematical reasoning.
arXiv Detail & Related papers (2025-07-24T07:03:11Z)
Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z)
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework for AGI quality assessment. It includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated. Experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-02-21T03:05:45Z)
Anomaly Detection in Smart Power Grids with Graph-Regularized MS-SVDD: a Multimodal Subspace Learning Approach [14.794452134569474]
We address an anomaly detection problem in smart power grids using Multimodal Subspace Support Vector Data Description (MS-SVDD) This approach aims to leverage better feature relations by considering the data as coming from different modalities. We introduce novel multimodal graph-embedded regularizers that leverage graph information for every modality to enhance the training process.
arXiv Detail & Related papers (2025-02-18T16:47:54Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images. In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS) Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z)
Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation [51.80447197290866]
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given knowledge graphs. Existing MMKGC methods usually extract multi-modal features with pre-trained models. We introduce a novel framework MyGO to tokenize, fuse, and augment the fine-grained multi-modal representations of entities.
arXiv Detail & Related papers (2024-04-15T05:40:41Z)
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning [8.1113308714581]
This paper introduces a novel multimodal chart question-answering model. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. This approach has demonstrated superior performance on multiple public datasets.
arXiv Detail & Related papers (2024-04-02T01:28:44Z)
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts. We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types. We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z)
Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z)
Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z)
Graph-Text Multi-Modal Pre-training for Medical Representation Learning [7.403725826586844]
We present MedGTX, a pre-trained model for multi-modal representation learning of structured and textual EHR data. We pre-train our model through four proxy tasks on MIMIC-III, an open-source EHR data. The results consistently show the effectiveness of pre-training the model for joint representation of both structured and unstructured information from EHR.
arXiv Detail & Related papers (2022-03-18T14:45:42Z)
Knowledge Perceived Multi-modal Pretraining in E-commerce [12.012793707741562]
Current multi-modal pretraining methods for image and text modalities lack robustness in the face of modality-missing and modality-noise. We propose K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities.
arXiv Detail & Related papers (2021-08-20T08:01:28Z)
Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations. We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.