LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long
Livestream Videos
- URL: http://arxiv.org/abs/2210.05840v1
- Date: Wed, 12 Oct 2022 00:08:17 GMT
- Title: LiveSeg: Unsupervised Multimodal Temporal Segmentation of Long
Livestream Videos
- Authors: Jielin Qiu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Ding Zhao,
Hailin Jin
- Abstract summary: Livestream tutorial videos are usually hours long, recorded, and uploaded to the Internet directly after the live sessions, making it hard for other people to catch up quickly.
An outline will be a beneficial solution, which requires the video to be temporally segmented according to topics.
We propose LiveSeg, an unsupervised Livestream video temporal solution, which takes advantage of multimodal features from different domains.
- Score: 82.48910259277984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Livestream videos have become a significant part of online learning, where
design, digital marketing, creative painting, and other skills are taught by
experienced experts in the sessions, making them valuable materials. However,
Livestream tutorial videos are usually hours long, recorded, and uploaded to
the Internet directly after the live sessions, making it hard for other people
to catch up quickly. An outline will be a beneficial solution, which requires
the video to be temporally segmented according to topics. In this work, we
introduced a large Livestream video dataset named MultiLive, and formulated the
temporal segmentation of the long Livestream videos (TSLLV) task. We propose
LiveSeg, an unsupervised Livestream video temporal Segmentation solution, which
takes advantage of multimodal features from different domains. Our method
achieved a $16.8\%$ F1-score performance improvement compared with the
state-of-the-art method.
Related papers
- Multimodal Language Models for Domain-Specific Procedural Video Summarization [0.0]
We study the use of multimodal models to enhance video summarization and step-by-step instruction generation within specific domains.
Our approach focuses on fine-tuning TimeChat to improve its performance in specific domains: cooking and medical procedures.
Our findings indicate that when finetuned on domain-specific procedural data, TimeChat can significantly improve the extraction and summarization of key instructional steps in long-format videos.
arXiv Detail & Related papers (2024-07-07T15:50:46Z) - VideoLLM-online: Online Video Large Language Model for Streaming Video [27.073238234038826]
We propose a novel Learning-In-Video-Stream framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream.
Our framework supports streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU.
It also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting.
arXiv Detail & Related papers (2024-06-17T17:55:32Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts [8.070778830276275]
We create a large-scale audio-visual multimodal dialogue dataset to facilitate the development of live commenting technologies.
The data is collected from Twitch, with 11 different categories and 575 streamers for a total of 438 hours of video and 3.2 million comments.
We propose a novel multimodal generation model capable of generating live comments that align with the temporal and spatial events within the video.
arXiv Detail & Related papers (2023-10-01T02:35:58Z) - VideoCutLER: Surprisingly Simple Unsupervised Video Instance
Segmentation [87.13210748484217]
VideoCutLER is a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos.
We show the first competitive unsupervised learning results on the challenging YouTubeVIS 2019 benchmark, achieving 50.7% APvideo50.
VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS 2019 in terms of APvideo.
arXiv Detail & Related papers (2023-08-28T17:10:12Z) - Tutorial Recommendation for Livestream Videos using Discourse-Level
Consistency and Ontology-Based Filtering [75.78484403289228]
We present a novel dataset and model for the task of tutorial recommendation for live-streamed videos.
A system can analyze the content of the live streaming video and recommend the most relevant tutorials.
arXiv Detail & Related papers (2022-09-11T22:45:57Z) - Cross-modal Manifold Cutmix for Self-supervised Video Representation
Learning [50.544635516455116]
This paper focuses on designing video augmentation for self-supervised learning.
We first analyze the best strategy to mix videos to create a new augmented video sample.
We propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities.
arXiv Detail & Related papers (2021-12-07T18:58:33Z) - Modeling Live Video Streaming: Real-Time Classification, QoE Inference,
and Field Evaluation [1.4353812560047186]
ReCLive is a machine learning method for live video detection and QoE measurement based on network-level behavioral characteristics.
We analyze about 23,000 video streams from Twitch and YouTube, and identify key features in their traffic profile that differentiate live and on-demand streaming.
Our solution provides ISPs with fine-grained visibility into live video streams, enabling them to measure and improve user experience.
arXiv Detail & Related papers (2021-12-05T17:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.