A Multimodal Transformer for Live Streaming Highlight Prediction
- URL: http://arxiv.org/abs/2407.12002v1
- Date: Sat, 15 Jun 2024 04:59:19 GMT
- Title: A Multimodal Transformer for Live Streaming Highlight Prediction
- Authors: Jiaxin Deng, Shiyao Wang, Dong Shen, Liqin Zhao, Fan Yang, Guorui Zhou, Gaofeng Meng,
- Abstract summary: Live streaming requires models to infer without future frames and process complex multimodal interactions.
We introduce a novel Modality Temporal Alignment Module to handle the temporal shift of cross-modal signals.
We propose a novel Border-aware Pairwise Loss to learn from a large-scale dataset and utilize user implicit feedback as a weak supervision signal.
- Score: 26.787089919015983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, live streaming platforms have gained immense popularity. Traditional video highlight detection mainly focuses on visual features and utilizes both past and future content for prediction. However, live streaming requires models to infer without future frames and process complex multimodal interactions, including images, audio and text comments. To address these issues, we propose a multimodal transformer that incorporates historical look-back windows. We introduce a novel Modality Temporal Alignment Module to handle the temporal shift of cross-modal signals. Additionally, using existing datasets with limited manual annotations is insufficient for live streaming whose topics are constantly updated and changed. Therefore, we propose a novel Border-aware Pairwise Loss to learn from a large-scale dataset and utilize user implicit feedback as a weak supervision signal. Extensive experiments show our model outperforms various strong baselines on both real-world scenarios and public datasets. And we will release our dataset and code to better assess this topic.
Related papers
- MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour Expansion [18.499672566131355]
Accurately modeling the gifting interaction not only enhances users' experience but also increases streamers' revenue.
Previous studies on live streaming gifting prediction treat this task as a conventional recommendation problem.
We propose MMBee based on real-time Multi-Modal Fusion and Behaviour Expansion to address these issues.
arXiv Detail & Related papers (2024-06-15T04:59:00Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction
with Multimodal Transformer [31.10377461705053]
We propose a ContentCTR model that leverages multimodal transformer for frame-level CTR prediction.
We conduct extensive experiments on both real-world scenarios and public datasets, and our ContentCTR model outperforms traditional recommendation models in capturing real-time content changes.
arXiv Detail & Related papers (2023-06-26T03:04:53Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - MultiPath++: Efficient Information Fusion and Trajectory Aggregation for
Behavior Prediction [42.563865078323204]
We present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks.
We show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition and Open Motion Prediction Challenge.
arXiv Detail & Related papers (2021-11-29T21:36:53Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - Graph2Kernel Grid-LSTM: A Multi-Cued Model for Pedestrian Trajectory
Prediction by Learning Adaptive Neighborhoods [10.57164270098353]
We present a new perspective to interaction modeling by proposing that pedestrian neighborhoods can become adaptive in design.
Our model outperforms state-of-the-art approaches that collate resembling features over several publicly-tested surveillance videos.
arXiv Detail & Related papers (2020-07-03T19:05:48Z) - Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.