Response to LiveBot: Generating Live Video Comments Based on Visual and
Textual Contexts
- URL: http://arxiv.org/abs/2006.03022v1
- Date: Thu, 4 Jun 2020 17:16:22 GMT
- Title: Response to LiveBot: Generating Live Video Comments Based on Visual and
Textual Contexts
- Authors: Hao Wu, Gareth J. F. Jones, Francois Pitie
- Abstract summary: LiveBot was recently introduced as a novel Automatic Live Video Commenting (ALVC) application.
LiveBot generates live video comments from both the existing video stream and existing viewers comments.
In this paper, we study these discrepancies in detail and propose an alternative baseline implementation.
- Score: 7.8885775363362
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Live video commenting systems are an emerging feature of online video sites.
Recently the Chinese video sharing platform Bilibili, has popularised a novel
captioning system where user comments are displayed as streams of moving
subtitles overlaid on the video playback screen and broadcast to all viewers in
real-time. LiveBot was recently introduced as a novel Automatic Live Video
Commenting (ALVC) application. This enables the automatic generation of live
video comments from both the existing video stream and existing viewers
comments. In seeking to reproduce the baseline results reported in the original
Livebot paper, we found differences between the reproduced results using the
project codebase and the numbers reported in the paper. Further examination of
this situation suggests that this may be caused by a number of small issues in
the project code, including a non-obvious overlap between the training and test
sets. In this paper, we study these discrepancies in detail and propose an
alternative baseline implementation as a reference for other researchers in
this field.
Related papers
- Enhancing Multimodal Affective Analysis with Learned Live Comment Features [12.437191675553423]
Live comments, also known as Danmaku, are user-generated messages that are synchronized with video content.
We first construct the Live Comment for Affective Analysis dataset which contains live comments for English and Chinese videos.
We then use contrastive learning to train a video encoder to produce synthetic live comment features for enhanced multimodal affective content analysis.
arXiv Detail & Related papers (2024-10-21T18:19:09Z) - HOTVCOM: Generating Buzzworthy Comments for Videos [49.39846630199698]
This study introduces textscHotVCom, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments.
We also present the textttComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset.
arXiv Detail & Related papers (2024-09-23T16:45:13Z) - Live Video Captioning [0.6291443816903801]
We introduce a paradigm shift towards Live Video Captioning (LVC)
In LVC, dense video captioning models must generate captions for video streams in an online manner.
We propose new evaluation metrics tailored for the online scenario, demonstrating their superiority over traditional metrics.
arXiv Detail & Related papers (2024-06-20T11:25:16Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts [8.070778830276275]
We create a large-scale audio-visual multimodal dialogue dataset to facilitate the development of live commenting technologies.
The data is collected from Twitch, with 11 different categories and 575 streamers for a total of 438 hours of video and 3.2 million comments.
We propose a novel multimodal generation model capable of generating live comments that align with the temporal and spatial events within the video.
arXiv Detail & Related papers (2023-10-01T02:35:58Z) - Knowledge Enhanced Model for Live Video Comment Generation [40.762720398152766]
We propose a knowledge enhanced generation model inspired by the divergent and informative nature of live video comments.
Our model adopts a pre-training encoder-decoder framework and incorporates external knowledge.
The MovieLC dataset and our code will be released.
arXiv Detail & Related papers (2023-04-28T07:03:50Z) - Connecting Vision and Language with Video Localized Narratives [54.094554472715245]
We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language.
In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment.
Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects.
arXiv Detail & Related papers (2023-02-22T09:04:00Z) - VTC: Improving Video-Text Retrieval with User Comments [22.193221760244707]
This paper introduces a new dataset of videos, titles and comments.
By using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.
arXiv Detail & Related papers (2022-10-19T18:11:39Z) - VPN: Video Provenance Network for Robust Content Attribution [72.12494245048504]
We present VPN - a content attribution method for recovering provenance information from videos shared online.
We learn a robust search embedding for matching such video, using full-length or truncated video queries.
Once matched against a trusted database of video clips, associated information on the provenance of the clip is presented to the user.
arXiv Detail & Related papers (2021-09-21T09:07:05Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.