A Multimodal Transformer for Live Streaming Highlight Prediction
- URL: http://arxiv.org/abs/2407.12002v1
- Date: Sat, 15 Jun 2024 04:59:19 GMT
- Title: A Multimodal Transformer for Live Streaming Highlight Prediction
- Authors: Jiaxin Deng, Shiyao Wang, Dong Shen, Liqin Zhao, Fan Yang, Guorui Zhou, Gaofeng Meng,
- Abstract summary: Live streaming requires models to infer without future frames and process complex multimodal interactions.
We introduce a novel Modality Temporal Alignment Module to handle the temporal shift of cross-modal signals.
We propose a novel Border-aware Pairwise Loss to learn from a large-scale dataset and utilize user implicit feedback as a weak supervision signal.
- Score: 26.787089919015983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, live streaming platforms have gained immense popularity. Traditional video highlight detection mainly focuses on visual features and utilizes both past and future content for prediction. However, live streaming requires models to infer without future frames and process complex multimodal interactions, including images, audio and text comments. To address these issues, we propose a multimodal transformer that incorporates historical look-back windows. We introduce a novel Modality Temporal Alignment Module to handle the temporal shift of cross-modal signals. Additionally, using existing datasets with limited manual annotations is insufficient for live streaming whose topics are constantly updated and changed. Therefore, we propose a novel Border-aware Pairwise Loss to learn from a large-scale dataset and utilize user implicit feedback as a weak supervision signal. Extensive experiments show our model outperforms various strong baselines on both real-world scenarios and public datasets. And we will release our dataset and code to better assess this topic.
Related papers
- DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z) - MMBee: Live Streaming Gift-Sending Recommendations via Multi-Modal Fusion and Behaviour Expansion [18.499672566131355]
Accurately modeling the gifting interaction not only enhances users' experience but also increases streamers' revenue.
Previous studies on live streaming gifting prediction treat this task as a conventional recommendation problem.
We propose MMBee based on real-time Multi-Modal Fusion and Behaviour Expansion to address these issues.
arXiv Detail & Related papers (2024-06-15T04:59:00Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - ContentCTR: Frame-level Live Streaming Click-Through Rate Prediction
with Multimodal Transformer [31.10377461705053]
We propose a ContentCTR model that leverages multimodal transformer for frame-level CTR prediction.
We conduct extensive experiments on both real-world scenarios and public datasets, and our ContentCTR model outperforms traditional recommendation models in capturing real-time content changes.
arXiv Detail & Related papers (2023-06-26T03:04:53Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - MultiPath++: Efficient Information Fusion and Trajectory Aggregation for
Behavior Prediction [42.563865078323204]
We present MultiPath++, a future prediction model that achieves state-of-the-art performance on popular benchmarks.
We show that our proposed model achieves state-of-the-art performance on the Argoverse Motion Forecasting Competition and Open Motion Prediction Challenge.
arXiv Detail & Related papers (2021-11-29T21:36:53Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - Graph2Kernel Grid-LSTM: A Multi-Cued Model for Pedestrian Trajectory
Prediction by Learning Adaptive Neighborhoods [10.57164270098353]
We present a new perspective to interaction modeling by proposing that pedestrian neighborhoods can become adaptive in design.
Our model outperforms state-of-the-art approaches that collate resembling features over several publicly-tested surveillance videos.
arXiv Detail & Related papers (2020-07-03T19:05:48Z) - Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.