POI-Enhancer: An LLM-based Semantic Enhancement Framework for POI Representation Learning
- URL: http://arxiv.org/abs/2502.10038v2
- Date: Tue, 04 Mar 2025 00:19:42 GMT
- Title: POI-Enhancer: An LLM-based Semantic Enhancement Framework for POI Representation Learning
- Authors: Jiawei Cheng, Jingyuan Wang, Yichuan Zhang, Jiahao Ji, Yuanshao Zhu, Zhibo Zhang, Xiangyu Zhao,
- Abstract summary: Recent studies have shown that enriching POI representations with multimodal information can significantly enhance their task performance.<n>Large language models (LLMs) trained on extensive text data have been found to possess rich textual knowledge.<n>We propose POI-Enhancer, a portable framework that leverages LLMs to improve POI representations produced by classic POI learning models.
- Score: 34.93661259065691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: POI representation learning plays a crucial role in handling tasks related to user mobility data. Recent studies have shown that enriching POI representations with multimodal information can significantly enhance their task performance. Previously, the textual information incorporated into POI representations typically involved only POI categories or check-in content, leading to relatively weak textual features in existing methods. In contrast, large language models (LLMs) trained on extensive text data have been found to possess rich textual knowledge. However leveraging such knowledge to enhance POI representation learning presents two key challenges: first, how to extract POI-related knowledge from LLMs effectively, and second, how to integrate the extracted information to enhance POI representations. To address these challenges, we propose POI-Enhancer, a portable framework that leverages LLMs to improve POI representations produced by classic POI learning models. We first design three specialized prompts to extract semantic information from LLMs efficiently. Then, the Dual Feature Alignment module enhances the quality of the extracted information, while the Semantic Feature Fusion module preserves its integrity. The Cross Attention Fusion module then fully adaptively integrates such high-quality information into POI representations and Multi-View Contrastive Learning further injects human-understandable semantic information into these representations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our framework, showing significant improvements across all baseline representations.
Related papers
- On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey [39.840208834931076]
General-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations.<n>We provide a comprehensive overview of GPTE in the era of pretrained language models (PLMs)<n>We describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation.
arXiv Detail & Related papers (2025-07-28T12:52:24Z) - Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z) - A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.428017294202162]
Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information.<n>This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components.
arXiv Detail & Related papers (2025-07-14T02:10:31Z) - Evolution of ReID: From Early Methods to LLM Integration [13.214445400030922]
Person re-identification has evolved from handcrafted feature-based methods to deep learning approaches.<n>This survey traces that full evolution and offers one of the first comprehensive reviews of ReID approaches that leverage large language models.<n>A key contribution is the use of dynamic, identity-specific prompts generated by GPT-4o, which enhance the alignment between images and text.
arXiv Detail & Related papers (2025-06-16T02:03:46Z) - RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models.
We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z) - EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model [16.20833396645551]
We propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets.
We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding.
arXiv Detail & Related papers (2024-06-27T15:18:23Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm [19.06214756792692]
In-context learning of large-language models (LLMs) has achieved remarkable success in the field of natural language processing.
Case studies reveal that the single-step chain-of-thought approach faces challenges such as attention diffusion and inadequate performance in complex tasks like text-to-correction.
A workflow paradigm is proposed, aiming to enhance the attention and problem-solving scope of LLMs through decomposition.
arXiv Detail & Related papers (2024-02-16T13:24:05Z) - CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval [66.93563107820687]
We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
arXiv Detail & Related papers (2023-09-18T05:38:49Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - M3PT: A Multi-Modal Model for POI Tagging [18.585818094015465]
We propose a novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced POI tagging.
We first devise a domain-adaptive image encoder (DIE) to obtain the image embeddings aligned to their gold tags' semantics.
In M3PT's text-image fusion module (TIF), the textual and visual representations are fully fused into the POIs' content embeddings for the subsequent matching.
arXiv Detail & Related papers (2023-06-16T05:46:27Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.