Related papers: Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion

Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion

URL: http://arxiv.org/abs/2407.09157v1
Date: Fri, 12 Jul 2024 10:44:51 GMT
Title: Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion
Authors: Linhan Xia, Yicheng Yang, Ziou Chen, Zheng Yang, Shengxin Zhu,
Abstract summary: This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets.
Score: 4.228539709089597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.

Related papers

LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection [8.24662649122549]
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. We propose the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks.
arXiv Detail & Related papers (2025-01-18T14:54:56Z)
Movie Trailer Genre Classification Using Multimodal Pretrained Features [1.1743167854433303]
We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. Our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling. Our method outperforms state-of-the-art movie genre classification models in terms of precision, recall, and mean average precision (mAP)
arXiv Detail & Related papers (2024-10-11T15:38:05Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations. We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z)
All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations. The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z)
Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions. MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost. We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z)
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks. We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment. Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z)
Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting. We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.