Generative Multimodal Entity Linking
- URL: http://arxiv.org/abs/2306.12725v4
- Date: Wed, 20 Mar 2024 01:30:41 GMT
- Title: Generative Multimodal Entity Linking
- Authors: Senbao Shi, Zhenran Xu, Baotian Hu, Min Zhang,
- Abstract summary: Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to referent entities from a knowledge base.
Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters.
We propose GEMEL, a Generative Multimodal Entity Linking framework based on Large Language Models (LLMs)
Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution.
- Score: 24.322540112710918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to the referent entities from a knowledge base. Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters, which can be prohibitively costly and difficult to scale in the era of Large Language Models (LLMs). In this work, we propose GEMEL, a Generative Multimodal Entity Linking framework based on LLMs, which directly generates target entity names. We keep the vision and language model frozen and only train a feature mapper to enable cross-modality interactions. To adapt LLMs to the MEL task, we leverage the in-context learning capability of LLMs by retrieving multimodal instances as demonstrations. Extensive experiments show that, with only ~0.3% of the model parameters fine-tuned, GEMEL achieves state-of-the-art results on two well-established MEL datasets (7.7% accuracy gains on WikiDiverse and 8.8% accuracy gains on WikiMEL). The performance gain stems from mitigating the popularity bias of LLM predictions and disambiguating less common entities effectively. Further analysis verifies the generality and scalability of GEMEL. Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution for utilizing LLMs in the MEL task. Our code is available at https://github.com/HITsz-TMG/GEMEL.
Related papers
- EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model [14.767055057048855]
We introduce the Data-Efficient and Compute-Efficient Multimodal Large Language Model (EE-MLLM)
EE-MLLM achieves both data and compute efficiency without introducing additional modules or learnable parameters.
Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks.
arXiv Detail & Related papers (2024-08-21T17:36:37Z) - Extend Model Merging from Fine-Tuned to Pre-Trained Large Language Models via Weight Disentanglement [72.97553348776425]
We make a pioneering effort to broaden the applicability of merging techniques from FT to PT LLMs.
We introduce an approach based on WeIght DisENtanglement (WIDEN) to effectively extend the merging scope.
We merge Qwen1.5-Chat (an FT LLM with instruction-following skills) with Sailor (a PT LLM with multilingual abilities) across 7B and 14B model scales.
arXiv Detail & Related papers (2024-08-06T10:46:46Z) - UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Sub-goal Distillation: A Method to Improve Small Language Agents [21.815417165548187]
Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks.
We propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model.
In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7%.
arXiv Detail & Related papers (2024-05-04T20:34:06Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Recommender AI Agent: Integrating Large Language Models for Interactive
Recommendations [53.76682562935373]
We introduce an efficient framework called textbfInteRecAgent, which employs LLMs as the brain and recommender models as tools.
InteRecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs.
arXiv Detail & Related papers (2023-08-31T07:36:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.