Related papers: Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

URL: http://arxiv.org/abs/2108.05158v1
Date: Wed, 11 Aug 2021 11:11:43 GMT
Title: Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering
Authors: Donggeon Lee, Seongho Choi, Youwon Jang, Byoung-Tak Zhang
Abstract summary: We challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.
Score: 18.664991529995664
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video question answering has recently received a lot of attention from multimodal video researchers. Most video question answering datasets are usually in the form of multiple-choice. But, the model for the multiple-choice task does not infer the answer. Rather it compares the answer candidates for picking the correct answer. Furthermore, it makes it difficult to extend to other tasks. In this paper, we challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. The model is fine-tuned with video inputs and subtitles. An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.

Related papers

MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z)
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models. In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z)
Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner. CoVGT's uniqueness and superiority are three-fold. We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA. Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z)
Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z)
Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests. We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z)
Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.