ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction
with Multimodal Transformer
- URL: http://arxiv.org/abs/2306.14392v1
- Date: Mon, 26 Jun 2023 03:04:53 GMT
- Title: ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction
with Multimodal Transformer
- Authors: Jiaxin Deng, Dong Shen, Shiyao Wang, Xiangyu Wu, Fan Yang, Guorui
Zhou, Gaofeng Meng
- Abstract summary: We propose a ContentCTR model that leverages multimodal transformer for frame-level CTR prediction.
We conduct extensive experiments on both real-world scenarios and public datasets, and our ContentCTR model outperforms traditional recommendation models in capturing real-time content changes.
- Score: 31.10377461705053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, live streaming platforms have gained immense popularity as
they allow users to broadcast their videos and interact in real-time with hosts
and peers. Due to the dynamic changes of live content, accurate recommendation
models are crucial for enhancing user experience. However, most previous works
treat the live as a whole item and explore the Click-through-Rate (CTR)
prediction framework on item-level, neglecting that the dynamic changes that
occur even within the same live room. In this paper, we proposed a ContentCTR
model that leverages multimodal transformer for frame-level CTR prediction.
First, we present an end-to-end framework that can make full use of multimodal
information, including visual frames, audio, and comments, to identify the most
attractive live frames. Second, to prevent the model from collapsing into a
mediocre solution, a novel pairwise loss function with first-order difference
constraints is proposed to utilize the contrastive information existing in the
highlight and non-highlight frames. Additionally, we design a temporal
text-video alignment module based on Dynamic Time Warping to eliminate noise
caused by the ambiguity and non-sequential alignment of visual and textual
information. We conduct extensive experiments on both real-world scenarios and
public datasets, and our ContentCTR model outperforms traditional
recommendation models in capturing real-time content changes. Moreover, we
deploy the proposed method on our company platform, and the results of online
A/B testing further validate its practical significance.
Related papers
- A Multimodal Transformer for Live Streaming Highlight Prediction [26.787089919015983]
Live streaming requires models to infer without future frames and process complex multimodal interactions.
We introduce a novel Modality Temporal Alignment Module to handle the temporal shift of cross-modal signals.
We propose a novel Border-aware Pairwise Loss to learn from a large-scale dataset and utilize user implicit feedback as a weak supervision signal.
arXiv Detail & Related papers (2024-06-15T04:59:19Z) - TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation [97.96178992465511]
We argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses.
To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics.
arXiv Detail & Related papers (2024-06-12T21:41:32Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - SimOn: A Simple Framework for Online Temporal Action Localization [51.27476730635852]
We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
arXiv Detail & Related papers (2022-11-08T04:50:54Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Online Video Instance Segmentation via Robust Context Fusion [36.376900904288966]
Video instance segmentation (VIS) aims at classifying, segmenting and tracking object instances in video sequences.
Recent transformer-based neural networks have demonstrated their powerful capability of modeling for the VIS task.
We propose a robust context fusion network to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames.
arXiv Detail & Related papers (2022-07-12T15:04:50Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.