Data augmentation techniques for the Video Question Answering task
- URL: http://arxiv.org/abs/2008.09849v1
- Date: Sat, 22 Aug 2020 14:34:55 GMT
- Title: Data augmentation techniques for the Video Question Answering task
- Authors: Alex Falcon, Oswald Lanz, Giuseppe Serra
- Abstract summary: We focus on the Egocentric VideoQA task, which exploits first-person videos.
Given its small size, models tend to overfit quickly.
We propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.
- Score: 16.548016892117083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Question Answering (VideoQA) is a task that requires a model to analyze
and understand both the visual content given by the input video and the textual
part given by the question, and the interaction between them in order to
produce a meaningful answer. In our work we focus on the Egocentric VideoQA
task, which exploits first-person videos, because of the importance of such
task which can have impact on many different fields, such as those pertaining
the social assistance and the industrial training. Recently, an Egocentric
VideoQA dataset, called EgoVQA, has been released. Given its small size, models
tend to overfit quickly. To alleviate this problem, we propose several
augmentation techniques which give us a +5.5% improvement on the final accuracy
over the considered baseline.
Related papers
- VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.
Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.
We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.
The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - MM-Ego: Towards Building Egocentric Multimodal LLMs [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding.
We develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z) - Grounded Question-Answering in Long Egocentric Videos [39.281013854331285]
open-ended question-answering (QA) in long, egocentric videos allows individuals or robots to inquire about their own past visual experiences.
This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content.
Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation.
arXiv Detail & Related papers (2023-12-11T16:31:55Z) - TG-VQA: Ternary Game of Video Question Answering [33.180788803602084]
Video question answering aims at answering a question about the video content by reasoning the alignment semantics within them.
In this work, we innovatively resort to game theory, which can simulate complicated relationships among multiple players with specific interaction strategies.
Specifically, we carefully design a VideoQA-specific interaction strategy to tailor the characteristics of VideoQA, which can mathematically generate the fine-grained visual-linguistic alignment label without label-intensive efforts.
arXiv Detail & Related papers (2023-05-17T08:42:53Z) - Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.