Related papers: HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

URL: http://arxiv.org/abs/2502.05822v1
Date: Sun, 09 Feb 2025 09:07:11 GMT
Title: HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Authors: Guobing Gan, Kaiming Gao, Li Wang, Shen Jiang, Peng Jiang,
Abstract summary: This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems.<n>It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks.<n>The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.
Score: 10.61722566941537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.

Related papers

HarmonRank: Ranking-aligned Multi-objective Ensemble for Live-streaming E-commerce Recommendation [17.992877606615533]
Live-streaming e-commerce requires ranking mechanism to balance both purchases and user-streamer interactions.<n>We propose a novel multi-objective ensemble framework HarmonRank to fulfill both alignment to the ranking task and alignment among objectives.<n>The proposed method has been fully deployed in Kuaishou's live-streaming e-commerce recommendation platform with 400 million DAUs, contributing over 2% purchase gain.
arXiv Detail & Related papers (2026-01-06T11:59:02Z)
Practice on Long Behavior Sequence Modeling in Tencent Advertising [75.65309022911994]
Long-sequence modeling has become an indispensable frontier in recommendation systems for capturing users' long-term preferences.<n>We propose several practical approaches within the two-stage framework for long-sequence modeling.<n> Deployed in production on Tencent's large-scale advertising platforms, our innovations delivered significant performance gains.
arXiv Detail & Related papers (2025-09-10T06:55:57Z)
SUMMA: A Multimodal Large Language Model for Advertisement Summarization [15.514886325064792]
We propose SUMMA, a model that processes video ads into summaries highlighting the content of highest commercial value.<n> SUMMA is developed via a two-stage training strategy-multimodal supervised fine-tuning followed by reinforcement learning.<n>Online experiments show a statistically significant 1.5% increase in advertising revenue.
arXiv Detail & Related papers (2025-08-28T09:19:53Z)
Bidding-Aware Retrieval for Multi-Stage Consistency in Online Advertising [30.108437268612438]
Bidding-Aware Retrieval (BAR) is a model-based retrieval framework that addresses multi-stage inconsistency by incorporating ad bid value into the retrieval scoring function.<n>BAR's core innovation is Bidding-Aware Modeling, incorporating bid signals through monotonicity-constrained learning and multi-task distillation to ensure economically coherent representations.<n>Extensive offline experiments and full-scale deployment across Alibaba's display advertising platform validated BAR's efficacy.
arXiv Detail & Related papers (2025-08-07T09:43:34Z)
Balancing Semantic Relevance and Engagement in Related Video Recommendations [21.2575040646784]
Related video recommendations commonly use collaborative filtering (CF) driven by co-engagement signals.<n>This paper introduces a novel multi-objective retrieval framework to balance semantic relevance and user engagement.
arXiv Detail & Related papers (2025-07-12T21:04:25Z)
Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression [25.657978409890973]
Action Assessment (AQA) aims at automatic and fair evaluation of athletic performance.<n>Current methods focus on segmenting video into fixed frames, which disrupts the temporal continuity of sub-actions.<n>We propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression.
arXiv Detail & Related papers (2025-01-07T10:20:16Z)
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z)
A Novel Energy based Model Mechanism for Multi-modal Aspect-Based Sentiment Analysis [85.77557381023617]
We propose a novel framework called DQPSA for multi-modal sentiment analysis. PDQ module uses the prompt as both a visual query and a language query to extract prompt-aware visual information. EPE module models the boundaries pairing of the analysis target from the perspective of an Energy-based Model.
arXiv Detail & Related papers (2023-12-13T12:00:46Z)
CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z)
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search [27.42717207107]
Cross-Modal sponsored search displays multi-modal advertisements (ads) when consumers look for desired products by natural language queries in search engines. The ability to align ads-specific information in both images and texts is crucial for accurate and flexible sponsored search. We propose a simple alignment network for explicitly mapping fine-grained visual parts in ads images to the corresponding text.
arXiv Detail & Related papers (2023-09-28T03:43:57Z)
Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization [14.223683006262151]
We propose a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.
arXiv Detail & Related papers (2022-07-15T03:58:04Z)
Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp. SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
Instance-Level Relative Saliency Ranking with Graph Reasoning [126.09138829920627]
We present a novel unified model to segment salient instances and infer relative saliency rank order. A novel loss function is also proposed to effectively train the saliency ranking branch. experimental results demonstrate that our proposed model is more effective than previous methods.
arXiv Detail & Related papers (2021-07-08T13:10:42Z)
DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS. DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence. We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.