FITRep: Attention-Guided Item Representation via MLLMs
- URL: http://arxiv.org/abs/2511.21389v1
- Date: Wed, 26 Nov 2025 13:38:19 GMT
- Title: FITRep: Attention-Guided Item Representation via MLLMs
- Authors: Guoxiao Zhang, Ao Li, Tan Qu, Qianlong Xie, Xingxing Wang,
- Abstract summary: We propose FITRep, the first attention-guided, white-box item representation framework for fine-grained item deduplication.<n> Deployed on Meituan's advertising system, FITRep achieves +3.60% CTR and +4.25% CPM gains in online A/B tests, demonstrating both effectiveness and real-world impact.
- Score: 8.026404756145485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online platforms usually suffer from user experience degradation due to near-duplicate items with similar visuals and text. While Multimodal Large Language Models (MLLMs) enable multimodal embedding, existing methods treat representations as black boxes, ignoring structural relationships (e.g., primary vs. auxiliary elements), leading to local structural collapse problem. To address this, inspired by Feature Integration Theory (FIT), we propose FITRep, the first attention-guided, white-box item representation framework for fine-grained item deduplication. FITRep consists of: (1) Concept Hierarchical Information Extraction (CHIE), using MLLMs to extract hierarchical semantic concepts; (2) Structure-Preserving Dimensionality Reduction (SPDR), an adaptive UMAP-based method for efficient information compression; and (3) FAISS-Based Clustering (FBC), a FAISS-based clustering that assigns each item a unique cluster id using FAISS. Deployed on Meituan's advertising system, FITRep achieves +3.60% CTR and +4.25% CPM gains in online A/B tests, demonstrating both effectiveness and real-world impact.
Related papers
- DMESR: Dual-view MLLM-based Enhancing Framework for Multimodal Sequential Recommendation [13.114773060703891]
We propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR)<n>For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs.<n>For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics.
arXiv Detail & Related papers (2026-02-14T10:42:56Z) - Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation [1.0839192829439435]
Hi-SAM is a Hierarchical Structure-Aware Multi-modal framework with two designs.<n>It unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy.<n> Deployed on a large-scale social platform, Hi-SAM achieved a 6.55% gain in the core online metric.
arXiv Detail & Related papers (2026-02-12T10:26:15Z) - Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification [0.2799896314754614]
We introduce a method for efficient multi-label text classification with large language models (LLMs)<n>Instead of generating all labels in a single structured response, each target dimension is queried independently.<n>Our findings suggest that decomposing multi-label classification into dichotomic queries offers a scalable and effective framework.
arXiv Detail & Related papers (2025-11-05T19:53:51Z) - TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding [52.59372043981724]
TableDART is a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models.<n>In addition, we propose a novel agent to cross-modal knowledge integration by analyzing outputs from text- and image-based models.
arXiv Detail & Related papers (2025-09-18T07:00:13Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP [57.49519639951552]
We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations.<n>Experiments on the CIFAR-100 and Flickr30K datasets demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples.
arXiv Detail & Related papers (2024-10-30T17:51:31Z) - Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.<n>We introduce a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.<n>We propose a simple yet effective Test-time Adaptive Cross-modal (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition [12.382193259575805]
We propose a novel multi-modality co-learning (MMCL) framework for efficient skeleton-based action recognition.
Our MMCL framework engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference.
arXiv Detail & Related papers (2024-07-22T15:16:47Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - Federated Unsupervised Representation Learning [56.715917111878106]
We formulate a new problem in federated learning called Federated Unsupervised Representation Learning (FURL) to learn a common representation model without supervision.
FedCA is composed of two key modules: dictionary module to aggregate the representations of samples from each client and share with all clients for consistency of representation space and alignment module to align the representation of each client on a base model trained on a public data.
arXiv Detail & Related papers (2020-10-18T13:28:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.