PerSRV: Personalized Sticker Retrieval with Vision-Language Model
- URL: http://arxiv.org/abs/2410.21801v1
- Date: Tue, 29 Oct 2024 07:13:47 GMT
- Title: PerSRV: Personalized Sticker Retrieval with Vision-Language Model
- Authors: Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang,
- Abstract summary: We propose the Personalized Sticker Retrieval with Vision-Language Model framework, namely PerSRV, structured into offline calculations and online processing modules.
For sticker-level semantic understanding, we supervised fine-tuned LLaVA-1.5-7B to generate human-like sticker semantics.
Thirdly, we cluster style centroids based on users' historical interactions to achieve personal preference modeling.
- Score: 21.279568613306573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instant Messaging is a popular means for daily communication, allowing users to send text and stickers. As the saying goes, "a picture is worth a thousand words", so developing an effective sticker retrieval technique is crucial for enhancing user experience. However, existing sticker retrieval methods rely on labeled data to interpret stickers, and general-purpose Vision-Language Models (VLMs) often struggle to capture the unique semantics of stickers. Additionally, relevant-based sticker retrieval methods lack personalization, creating a gap between diverse user expectations and retrieval results. To address these, we propose the Personalized Sticker Retrieval with Vision-Language Model framework, namely PerSRV, structured into offline calculations and online processing modules. The online retrieval part follows the paradigm of relevant recall and personalized ranking, supported by the offline pre-calculation parts, which are sticker semantic understanding, utility evaluation and personalization modules. Firstly, for sticker-level semantic understanding, we supervised fine-tuned LLaVA-1.5-7B to generate human-like sticker semantics, complemented by textual content extracted from figures and historical interaction queries. Secondly, we investigate three crowd-sourcing metrics for sticker utility evaluation. Thirdly, we cluster style centroids based on users' historical interactions to achieve personal preference modeling. Finally, we evaluate our proposed PerSRV method on a public sticker retrieval dataset from WeChat, containing 543,098 candidates and 12,568 interactions. Experimental results show that PerSRV significantly outperforms existing methods in multi-modal sticker retrieval. Additionally, our fine-tuned VLM delivers notable improvements in sticker semantic understandings.
Related papers
- Training-Free Personalization via Retrieval and Reasoning on Fingerprints [31.025439143093585]
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts.
We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs.
R2P consistently outperforms state-of-the-art approaches on various downstream tasks.
arXiv Detail & Related papers (2025-03-24T12:36:24Z) - Impact of Stickers on Multimodal Chat Sentiment Analysis and Intent Recognition: A New Task, Dataset and Baseline [4.375392069380812]
We propose a new task: Multimodal chat Sentiment Analysis and Intent Recognition involving Stickers (MSAIRS)
We introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms.
Our dataset and code will be publicly available.
arXiv Detail & Related papers (2024-05-14T08:42:49Z) - HuBERTopic: Enhancing Semantic Representation of HuBERT through
Self-supervision Utilizing Topic Model [62.995175485416]
We propose a new approach to enrich the semantic representation of HuBERT.
An auxiliary topic classification task is added to HuBERT by using topic labels as teachers.
Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks.
arXiv Detail & Related papers (2023-10-06T02:19:09Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Sticker820K: Empowering Interactive Retrieval with Stickers [34.67442172774095]
We propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs.
Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications.
For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0% in mean recall.
arXiv Detail & Related papers (2023-06-12T05:06:53Z) - Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification [5.960550152906609]
We capture hinting features from user comments, which are retrieved via jointly leveraging visual and lingual similarity.
The classification tasks are explored via self-training in a teacher-student framework, motivated by the usually limited labeled data scales.
The results show that our method further advances the performance of previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-27T08:59:55Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - Selecting Stickers in Open-Domain Dialogue through Multitask Learning [51.67855506570727]
We propose a multitask learning method comprised of three auxiliary tasks to enhance the understanding of dialogue history, emotion and semantic meaning of stickers.
Our model can better combine the multimodal information and achieve significantly higher accuracy over strong baselines.
arXiv Detail & Related papers (2022-09-16T03:45:22Z) - Learning to Respond with Your Favorite Stickers: A Framework of Unifying
Multi-Modality and User Preference in Multi-Turn Dialog [67.91114640314004]
Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps.
Some works are dedicated to automatically select sticker response by matching the stickers image with previous utterances.
We propose to recommend an appropriate sticker to user based on multi-turn dialog context and sticker using history of user.
arXiv Detail & Related papers (2020-11-05T03:31:17Z) - Learning to Respond with Stickers: A Framework of Unifying
Multi-Modality in Multi-Turn Dialog [65.7021675527543]
Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps.
Some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances.
We propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels.
arXiv Detail & Related papers (2020-03-10T13:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.