Related papers: Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

URL: http://arxiv.org/abs/2509.21151v1
Date: Thu, 25 Sep 2025 13:38:38 GMT
Title: Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Authors: Lei Hei, Tingjing Liao, Yingxin Pei, Yiyang Qi, Jiaqi Wang, Ruiting Li, Feiliang Ren,
Abstract summary: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text.<n>underlineRetrieval underlineOver underlineClassification (ROC) is a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics.<n>ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning.
Score: 6.478238734128006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.

Related papers

CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning [67.18702329644526]
CoT Referring enhances model reasoning across modalities through a structured, chain-of-thought training data structure.<n>We restructure the training data to enforce a new output form, providing new annotations for existing datasets.<n>We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance.
arXiv Detail & Related papers (2025-10-03T08:50:21Z)
Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Multimodal Representation Learning Conditioned on Semantic Relations [10.999120598129126]
Multimodal representation learning has advanced rapidly with contrastive models such as CLIP.<n>We propose Relation-Conditioned Multimodal Learning RCML, a framework that learns multimodal representations under natural-language relation descriptions.<n>Our approach constructs many-to-many training pairs linked by semantic relations and introduces a relation-guided cross-attention mechanism.
arXiv Detail & Related papers (2025-08-24T19:36:18Z)
ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension [29.50623143244436]
ReMeREC aims to localize specified entities or regions in an image based on natural language descriptions.<n>We first construct a relation-aware, multi-entity REC dataset called ReMeX.<n>We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities.
arXiv Detail & Related papers (2025-07-22T11:23:48Z)
Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension [31.952192907460713]
Relation-R1 is the textitfirst unified relation comprehension framework.<n>It integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization ( GRPO)<n>Experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and textitN-ary relation understanding.
arXiv Detail & Related papers (2025-04-20T14:50:49Z)
Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z)
DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking [31.15972952813689]
We propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2023-10-09T10:21:42Z)
Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations. These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z)
Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image. We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z)
Modelling Multi-relations for Convolutional-based Knowledge Graph Embedding [0.2752817022620644]
It is considered that such approaches disconnect the semantic connection of multi-relations between an entity pair. We propose a convolutional and multi-relational learning model, ConvMR. We show that ConvMR is efficient to deal with less frequent entities.
arXiv Detail & Related papers (2022-10-21T03:43:06Z)
Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation [70.58243648754507]
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR) Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously. Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic similarity and transfer tasks.
arXiv Detail & Related papers (2022-10-18T11:37:36Z)
Transformer-based Dual Relation Graph for Multi-label Image Recognition [56.12543717723385]
We propose a novel Transformer-based Dual Relation learning framework. We explore two aspects of correlation, i.e., structural relation graph and semantic relation graph. Our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks.
arXiv Detail & Related papers (2021-10-10T07:14:52Z)
Multiplex Word Embeddings for Selectional Preference Acquisition [70.33531759861111]
We propose a multiplex word embedding model, which can be easily extended according to various relations among words. Our model can effectively distinguish words with respect to different relations without introducing unnecessary sparseness.
arXiv Detail & Related papers (2020-01-09T04:47:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.