Related papers: Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

Multimodal Relational Triple Extraction with Query-based Entity Object Transformer

URL: http://arxiv.org/abs/2408.08709v1
Date: Fri, 16 Aug 2024 12:43:38 GMT
Title: Multimodal Relational Triple Extraction with Query-based Entity Object Transformer
Authors: Lei Hei, Ning An, Tingjing Liao, Qi Ma, Jiaqi Wang, Feiliang Ren,
Abstract summary: Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge. We propose Multimodal Entity-Object Triple Extraction, which aims to extract all triples (entity, relation, object region) from image-text pairs. We also propose QEOT, a query-based model with a selective attention mechanism to dynamically explore the interaction and fusion of textual and visual information.
Score: 20.97497765985682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Relation Extraction is crucial for constructing flexible and realistic knowledge graphs. Recent studies focus on extracting the relation type with entity pairs present in different modalities, such as one entity in the text and another in the image. However, existing approaches require entities and objects given beforehand, which is costly and impractical. To address the limitation, we propose a novel task, Multimodal Entity-Object Relational Triple Extraction, which aims to extract all triples (entity span, relation, object region) from image-text pairs. To facilitate this study, we modified a multimodal relation extraction dataset MORE, which includes 21 relation types, to create a new dataset containing 20,264 triples, averaging 5.75 triples per image-text pair. Moreover, we propose QEOT, a query-based model with a selective attention mechanism, to dynamically explore the interaction and fusion of textual and visual information. In particular, the proposed method can simultaneously accomplish entity extraction, relation classification, and object detection with a set of queries. Our method is suitable for downstream applications and reduces error accumulation due to the pipeline-style approaches. Extensive experimental results demonstrate that our proposed method outperforms the existing baselines by 8.06% and achieves state-of-the-art performance.

Related papers

Composed Object Retrieval: Object-level Retrieval via Composed Expressions [71.47650333199628]
Composed Object Retrieval (COR) is a brand-new task that goes beyond image-level retrieval to achieve object-level precision.<n>We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories.<n>We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning.
arXiv Detail & Related papers (2025-08-06T13:11:40Z)
DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data [67.99373622902827]
DIPO is a framework for controllable generation of articulated 3D objects from a pair of images.<n>We propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters.<n>We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions.
arXiv Detail & Related papers (2025-05-26T18:55:14Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z)
Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph. We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z)
Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues. We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects. Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z)
Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks. We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z)
ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles. Our proposed method ReSel decomposes this task into a two-stage procedure. Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z)
Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction. We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z)
A Masked Image Reconstruction Network for Document-level Relation Extraction [3.276435438007766]
Document-level relation extraction requires inference over multiple sentences to extract complex relational triples. We propose a novel Document-level Relation Extraction model based on a Masked Image Reconstruction network (DRE-MIR) We evaluate our model on three public document-level relation extraction datasets.
arXiv Detail & Related papers (2022-04-21T02:41:21Z)
Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction [40.00702385889112]
We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples. We design a hybrid learning mechanism that bridges text and knowledge concerning both entities and relations. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.
arXiv Detail & Related papers (2020-10-30T04:18:39Z)
HittER: Hierarchical Transformers for Knowledge Graph Embeddings [85.93509934018499]
We propose Hitt to learn representations of entities and relations in a complex knowledge graph. Experimental results show that Hitt achieves new state-of-the-art results on multiple link prediction. We additionally propose a simple approach to integrate Hitt into BERT and demonstrate its effectiveness on two Freebase factoid answering datasets.
arXiv Detail & Related papers (2020-08-28T18:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.