Multi-Grained Query-Guided Set Prediction Network for Grounded   Multimodal Named Entity Recognition
        - URL: http://arxiv.org/abs/2407.21033v3
- Date: Sat, 25 Jan 2025 11:53:12 GMT
- Title: Multi-Grained Query-Guided Set Prediction Network for Grounded   Multimodal Named Entity Recognition
- Authors: Jielong Tang, Zhenxing Wang, Ziyang Gong, Jianxing Yu, Xiangwei Zhu, Jian Yin, 
- Abstract summary: Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task.<n>Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task.<n>We propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels.
- Score: 9.506482334842293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Grounded Multimodal Named Entity Recognition (GMNER) is an emerging information extraction (IE) task, aiming to simultaneously extract entity spans, types, and corresponding visual regions of entities from given sentence-image pairs data. Recent unified methods employing machine reading comprehension or sequence generation-based frameworks show limitations in this difficult task. The former, utilizing human-designed type queries, struggles to differentiate ambiguous entities, such as Jordan (Person) and off-White x Jordan (Shoes). The latter, following the one-by-one decoding order, suffers from exposure bias issues. We maintain that these works misunderstand the relationships of multimodal entities. To tackle these, we propose a novel unified framework named Multi-grained Query-guided Set Prediction Network (MQSPN) to learn appropriate relationships at intra-entity and inter-entity levels. Specifically, MQSPN explicitly aligns textual entities with visual regions by employing a set of learnable queries to strengthen intra-entity connections. Based on distinct intra-entity modeling, MQSPN reformulates GMNER as a set prediction, guiding models to establish appropriate inter-entity relationships from a optimal global matching perspective. Additionally, we incorporate a query-guided Fusion Net (QFNet) as a glue network to boost better alignment of two-level relationships. Extensive experiments demonstrate that our approach achieves state-of-the-art performances in widely used benchmarks. 
 
      
        Related papers
        - ReMeREC: Relation-aware and Multi-entity Referring Expression   Comprehension [29.50623143244436]
 ReMeREC aims to localize specified entities or regions in an image based on natural language descriptions.<n>We first construct a relation-aware, multi-entity REC dataset called ReMeX.<n>We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities.
 arXiv  Detail & Related papers  (2025-07-22T11:23:48Z)
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
 Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
 Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
 arXiv  Detail & Related papers  (2024-10-31T06:55:24Z)
- OneNet: A Fine-Tuning Free Framework for Few-Shot Entity Linking via   Large Language Model Prompting [49.655711022673046]
 OneNet is an innovative framework that utilizes the few-shot learning capabilities of Large Language Models (LLMs) without the need for fine-tuning.
OneNet is structured around three key components prompted by LLMs: (1) an entity reduction processor that simplifies inputs by summarizing and filtering out irrelevant entities, (2) a dual-perspective entity linker that combines contextual cues and prior knowledge for precise entity linking, and (3) an entity consensus judger that employs a unique consistency algorithm to alleviate the hallucination in the entity linking reasoning.
 arXiv  Detail & Related papers  (2024-10-10T02:45:23Z)
- IBMEA: Exploring Variational Information Bottleneck for Multi-modal   Entity Alignment [17.570243718626994]
 Multi-modal entity alignment (MMEA) aims to identify equivalent entities between multi-modal knowledge graphs (MMKGs)
We devise multi-modal variational encoders to generate modal-specific entity representations as probability distributions.
We also propose four modal-specific information bottleneck regularizers, limiting the misleading clues in refining modal-specific entity representations.
 arXiv  Detail & Related papers  (2024-07-27T17:12:37Z)
- ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine   Semantic Modeling [53.97609687516371]
 We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
 arXiv  Detail & Related papers  (2024-06-25T12:47:04Z)
- LLMs as Bridges: Reformulating Grounded Multimodal Named Entity   Recognition [28.136662420053568]
 Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions.
We propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as a connecting bridge.
 arXiv  Detail & Related papers  (2024-02-15T14:54:33Z)
- DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking [31.15972952813689]
 We propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks.
DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples.
 Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
 arXiv  Detail & Related papers  (2023-10-09T10:21:42Z)
- M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
  Fine-grained Action Recognition [80.21796574234287]
 M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
 arXiv  Detail & Related papers  (2023-08-06T09:15:14Z)
- Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
 Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
 arXiv  Detail & Related papers  (2023-07-19T02:11:19Z)
- Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
 Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
 arXiv  Detail & Related papers  (2023-06-19T15:31:34Z)
- Attribute-Consistent Knowledge Graph Representation Learning for
  Multi-Modal Entity Alignment [14.658282035561792]
 We propose a novel attribute-consistent knowledge graph representation learning framework for MMEA (ACK-MMEA)
Our approach achieves excellent performance compared to its competitors.
 arXiv  Detail & Related papers  (2023-04-04T06:39:36Z)
- Enhancing Multi-modal and Multi-hop Question Answering via Structured
  Knowledge and Unified Retrieval-Generation [33.56304858796142]
 Multi-modal multi-hop question answering involves answering a question by reasoning over multiple input sources from different modalities.
Existing methods often retrieve evidences separately and then use a language model to generate an answer based on the retrieved evidences.
We propose a Structured Knowledge and Unified Retrieval-Generation (RG) approach to address these issues.
 arXiv  Detail & Related papers  (2022-12-16T18:12:04Z)
- MNER-QG: An End-to-End MRC framework for Multimodal Named Entity
  Recognition with Query Grounding [21.49274082010887]
 Multimodal named entity recognition (MNER) is a critical step in information extraction.
We propose a novel end-to-end framework named MNER-QG that can simultaneously perform MRC-based MRC-based named entity recognition and query grounding.
 arXiv  Detail & Related papers  (2022-11-27T06:10:03Z)
- Multi-modal Contrastive Representation Learning for Entity Alignment [57.92705405276161]
 Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs.
We propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model.
In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions.
 arXiv  Detail & Related papers  (2022-09-02T08:59:57Z)
- Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
  Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
 We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
 arXiv  Detail & Related papers  (2022-05-07T02:10:55Z)
- CoADNet: Collaborative Aggregation-and-Distribution Networks for
  Co-Salient Object Detection [91.91911418421086]
 Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
 arXiv  Detail & Related papers  (2020-11-10T04:28:11Z)
- Adaptive Attentional Network for Few-Shot Knowledge Graph Completion [16.722373937828117]
 Few-shot Knowledge Graph (KG) completion is a focus of current research, where each task aims at querying unseen facts of a relation given its few-shot reference entity pairs.
Recent attempts solve this problem by learning static representations of entities and references, ignoring their dynamic properties.
This work proposes an adaptive attentional network for few-shot KG completion by learning adaptive entity and reference representations.
 arXiv  Detail & Related papers  (2020-10-19T16:27:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.