Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling
- URL: http://arxiv.org/abs/2210.03941v1
- Date: Sat, 8 Oct 2022 07:03:31 GMT
- Title: Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling
- Authors: Hsin-Ying Lee, Hung-Ting Su, Bing-Chen Tsai, Tsung-Han Wu, Jia-Fong
Yeh, Winston H. Hsu
- Abstract summary: We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
- Score: 28.530765643908083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent large-scale video-language pre-training made great progress in
video question answering, the design of spatial modeling of video-language
models is less fine-grained than that of image-language models; existing
practices of temporal modeling also suffer from weak and noisy alignment
between modalities. To learn fine-grained visual understanding, we decouple
spatial-temporal modeling and propose a hybrid pipeline, Decoupled
Spatial-Temporal Encoders, integrating an image- and a video-language encoder.
The former encodes spatial semantics from larger but sparsely sampled frames
independently of time, while the latter models temporal dynamics at lower
spatial but higher temporal resolution. To help the video-language model learn
temporal relations for video QA, we propose a novel pre-training objective,
Temporal Referring Modeling, which requires the model to identify temporal
positions of events in video sequences. Extensive experiments demonstrate that
our model outperforms previous work pre-trained on orders of magnitude larger
datasets.
Related papers
- ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Orthogonal Temporal Interpolation for Zero-Shot Video Recognition [45.53856045374685]
Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process.
Recent vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR.
arXiv Detail & Related papers (2023-08-14T02:26:49Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Simple Video Generation using Neural ODEs [9.303957136142293]
We learn latent variable models that predict the future in latent space and project back to pixels.
We show that our approach yields promising results in the task of future frame prediction on the Moving MNIST dataset with 1 and 2 digits.
arXiv Detail & Related papers (2021-09-07T19:03:33Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.