LIFI: Towards Linguistically Informed Frame Interpolation
- URL: http://arxiv.org/abs/2010.16078v5
- Date: Wed, 2 Dec 2020 16:47:06 GMT
- Title: LIFI: Towards Linguistically Informed Frame Interpolation
- Authors: Aradhya Neeraj Mathur, Devansh Batra, Yaman Kumar, Rajiv Ratn Shah,
Roger Zimmermann
- Abstract summary: We try to solve this problem by using several deep learning video generation algorithms to generate missing frames.
We release several datasets to test computer vision video generation models of their speech understanding.
- Score: 66.05105400951567
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this work, we explore a new problem of frame interpolation for speech
videos. Such content today forms the major form of online communication. We try
to solve this problem by using several deep learning video generation
algorithms to generate the missing frames. We also provide examples where
computer vision models despite showing high performance on conventional
non-linguistic metrics fail to accurately produce faithful interpolation of
speech. With this motivation, we provide a new set of linguistically-informed
metrics specifically targeted to the problem of speech videos interpolation. We
also release several datasets to test computer vision video generation models
of their speech understanding.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Unsupervised Multimodal Video-to-Video Translation via Self-Supervised
Learning [92.17835753226333]
We propose a novel unsupervised video-to-video translation model.
Our model decomposes the style and the content using the specialized UV-decoder structure.
Our model can produce photo-realistic videos in a multimodal way.
arXiv Detail & Related papers (2020-04-14T13:44:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.