Comprehensive Information Integration Modeling Framework for Video
Titling
- URL: http://arxiv.org/abs/2006.13608v1
- Date: Wed, 24 Jun 2020 10:38:15 GMT
- Title: Comprehensive Information Integration Modeling Framework for Video
Titling
- Authors: Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang,
Jingren Zhou, Hongxia Yang, Fei Wu
- Abstract summary: We integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework.
To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization.
We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform.
- Score: 124.11296128308396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In e-commerce, consumer-generated videos, which in general deliver consumers'
individual preferences for the different aspects of certain products, are
massive in volume. To recommend these videos to potential consumers more
effectively, diverse and catchy video titles are critical. However,
consumer-generated videos seldom accompany appropriate titles. To bridge this
gap, we integrate comprehensive sources of information, including the content
of consumer-generated videos, the narrative comment sentences supplied by
consumers, and the product attributes, in an end-to-end modeling framework.
Although automatic video titling is very useful and demanding, it is much less
addressed than video captioning. The latter focuses on generating sentences
that describe videos as a whole while our task requires the product-aware
multi-grained video analysis. To tackle this issue, the proposed method
consists of two processes, i.e., granular-level interaction modeling and
abstraction-level story-line summarization. Specifically, the granular-level
interaction modeling first utilizes temporal-spatial landmark cues, descriptive
words, and abstractive attributes to builds three individual graphs and
recognizes the intra-actions in each graph through Graph Neural Networks (GNN).
Then the global-local aggregation module is proposed to model inter-actions
across graphs and aggregate heterogeneous graphs into a holistic graph
representation. The abstraction-level story-line summarization further
considers both frame-level video features and the holistic graph to utilize the
interactions between products and backgrounds, and generate the story-line
topic of the video. We collect a large-scale dataset accordingly from
real-world data in Taobao, a world-leading e-commerce platform, and will make
the desensitized version publicly available to nourish further development of
the research community...
Related papers
- Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos.
Our model uses a novel autoregressive factorized decoding architecture.
Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z) - Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling.
Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z) - Neural Graph Matching for Video Retrieval in Large-Scale Video-driven E-commerce [5.534002182451785]
Video-driven e-commerce has shown huge potential in stimulating consumer confidence and promoting sales.
We propose a novel bi-level Graph Matching Network (GMN), which mainly consists of node- and preference-level graph matching.
Comprehensive experiments show the superiority of the proposed GMN with significant improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-08-01T07:31:23Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Video Captioning with Aggregated Features Based on Dual Graphs and Gated
Fusion [6.096411752534632]
The application of video captioning models aims at translating content of videos by using accurate natural language.
Existing methods often fail in generating sufficient feature representations of video content.
We propose a video captioning model based on dual graphs and gated fusion.
arXiv Detail & Related papers (2023-08-13T05:18:08Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.