Characterizing Video Question Answering with Sparsified Inputs
- URL: http://arxiv.org/abs/2311.16311v1
- Date: Mon, 27 Nov 2023 21:00:20 GMT
- Title: Characterizing Video Question Answering with Sparsified Inputs
- Authors: Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang,
Gunnar A. Sigurdsson
- Abstract summary: We characterize a task with different input sparsity and provide a tool for doing that.
Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task.
From our experiments, we have observed only 5.2%-5.8% loss of performance with only 10% of video lengths.
- Score: 55.7455981156755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Video Question Answering, videos are often processed as a full-length
sequence of frames to ensure minimal loss of information. Recent works have
demonstrated evidence that sparse video inputs are sufficient to maintain high
performance. However, they usually discuss the case of single frame selection.
In our work, we extend the setting to multiple number of inputs and other
modalities. We characterize the task with different input sparsity and provide
a tool for doing that. Specifically, we use a Gumbel-based learnable selection
module to adaptively select the best inputs for the final task. In this way, we
experiment over public VideoQA benchmarks and provide analysis on how
sparsified inputs affect the performance. From our experiments, we have
observed only 5.2%-5.8% loss of performance with only 10% of video lengths,
which corresponds to 2-4 frames selected from each video. Meanwhile, we also
observed the complimentary behaviour between visual and textual inputs, even
under highly sparsified settings, suggesting the potential of improving data
efficiency for video-and-language tasks.
Related papers
- Causal Video Summarizer for Video Exploration [74.27487067877047]
Causal Video Summarizer (CVS) is proposed to capture the interactive information between the video and query.
Based on the evaluation of the existing multi-modal video summarization dataset, experimental results show that the proposed approach is effective.
arXiv Detail & Related papers (2023-07-04T22:52:16Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Self-supervised Video Representation Learning by Context and Motion
Decoupling [45.510042484456854]
A challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias.
We develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task.
Experiments show that our approach improves the quality of the learned video representation over previous works.
arXiv Detail & Related papers (2021-04-02T02:47:34Z) - Scene-Adaptive Video Frame Interpolation via Meta-Learning [54.87696619177496]
We propose to adapt the model to each video by making use of additional information that is readily available at test time.
We obtain significant performance gains with only a single gradient update without any additional parameters.
arXiv Detail & Related papers (2020-04-02T02:46:44Z) - Straight to the Point: Fast-forwarding Videos via Reinforcement Learning
Using Textual Data [1.004766879203303]
We present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos.
Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video.
We propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space.
arXiv Detail & Related papers (2020-03-31T14:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.