Poet: Product-oriented Video Captioner for E-commerce
- URL: http://arxiv.org/abs/2008.06880v1
- Date: Sun, 16 Aug 2020 10:53:46 GMT
- Title: Poet: Product-oriented Video Captioner for E-commerce
- Authors: Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu,
Jingren Zhou, Hongxia Yang, Fei Wu
- Abstract summary: In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting.
We propose a product-oriented video captioner framework, abbreviated as Poet.
We show that Poet achieves consistent performance improvement over previous methods concerning generation quality, product aspects capturing, and lexical diversity.
- Score: 124.9936946822493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In e-commerce, a growing number of user-generated videos are used for product
promotion. How to generate video descriptions that narrate the user-preferred
product characteristics depicted in the video is vital for successful
promoting. Traditional video captioning methods, which focus on routinely
describing what exists and happens in a video, are not amenable for
product-oriented video captioning. To address this problem, we propose a
product-oriented video captioner framework, abbreviated as Poet. Poet firstly
represents the videos as product-oriented spatial-temporal graphs. Then, based
on the aspects of the video-associated product, we perform knowledge-enhanced
spatial-temporal inference on those graphs for capturing the dynamic change of
fine-grained product-part characteristics. The knowledge leveraging module in
Poet differs from the traditional design by performing knowledge filtering and
dynamic memory modeling. We show that Poet achieves consistent performance
improvement over previous methods concerning generation quality, product
aspects capturing, and lexical diversity. Experiments are performed on two
product-oriented video captioning datasets, buyer-generated fashion video
dataset (BFVD) and fan-generated fashion video dataset (FFVD), collected from
Mobile Taobao. We will release the desensitized datasets to promote further
investigations on both video captioning and general video analysis problems.
Related papers
- SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities [0.7510165488300369]
This paper proposes a novel Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities (SPECTRUM) framework.
SPECTRUM discerns multimodal semantics and emotional themes using Visual Text Attribute Investigation (VTAI) and determines the orientation of descriptive captions.
They exploit video-to-text retrieval capabilities and the multifaceted nature of video content to estimate the emotional probabilities of candidate captions.
arXiv Detail & Related papers (2024-11-04T10:51:47Z) - SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model [66.34929233269409]
Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality.
We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG.
Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-05T06:27:32Z) - One-Shot Pose-Driving Face Animation Platform [7.422568903818486]
We refine an existing Image2Video model by integrating a Face Locator and Motion Frame mechanism.
We optimize the model using extensive human face video datasets, significantly enhancing its ability to produce high-quality talking head videos.
We develop a demo platform using the Gradio framework, which streamlines the process, enabling users to quickly create customized talking head videos.
arXiv Detail & Related papers (2024-07-12T03:09:07Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - Video Captioning with Aggregated Features Based on Dual Graphs and Gated
Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language.
Existing methods often fail in generating sufficient feature representations of video content.
We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist.
Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model.
Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z) - Multimodal Pretraining for Dense Video Captioning [26.39052753539932]
We construct and release a new dense video captioning dataset, Video Timeline Tags (ViTT)
We explore several multimodal sequence-to-sequence pretraining strategies that leverage large unsupervised datasets of videos and caption-like texts.
We show that such models generalize well and are robust over a wide variety of instructional videos.
arXiv Detail & Related papers (2020-11-10T21:49:14Z) - Comprehensive Information Integration Modeling Framework for Video
Titling [124.11296128308396]
We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
arXiv Detail & Related papers (2020-06-24T10:38:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.