Related papers: The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

URL: http://arxiv.org/abs/2407.01907v1
Date: Tue, 2 Jul 2024 03:13:27 GMT
Title: The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA
Authors: Hailiang Zhang, Dian Chao, Zhihao Guan, Yang Yang,
Abstract summary: Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. A significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects.
Score: 3.38659196496483
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets.

Related papers

First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge [4.075139470537149]
We present our first-place solution to the Multiple-choice Video Question Answering track of The Second Perception Test Challenge. This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content.
arXiv Detail & Related papers (2024-09-20T14:31:13Z)
Causal Understanding For Video Question Answering [2.749898166276854]
Video Question Answering is a challenging task, which requires the model to reason over multiple frames and understand the interaction between different objects to answer questions based on the context provided within the video. Previous approaches leverage either sub-sampled information or causal intervention techniques along with complete video features to tackle the NExT-QA task. In this work we elicit the limitations of these approaches and propose solutions along four novel directions of improvements on the NExT-QA dataset.
arXiv Detail & Related papers (2024-07-23T06:32:46Z)
Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z)
Solution for Point Tracking Task of ICCV 1st Perception Test Challenge 2023 [50.910598799408326]
The Tracking Any Point (TAP) task tracks any physical surface through a video. Several existing approaches have explored the TAP by considering the temporal relationships to obtain smooth point motion trajectories. We propose a simple yet effective approach called TAP with confident static points (TAPIR+), which focuses on rectifying the tracking of the static point in the videos shot by a static camera.
arXiv Detail & Related papers (2024-03-26T13:50:39Z)
Perception Test 2023: A Summary of the First Challenge And Outcome [67.0525378209708]
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023. The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. We summarise in this report the task descriptions, metrics, baselines, and results.
arXiv Detail & Related papers (2023-12-20T15:12:27Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules [85.98177341704675]
The problem of grounding VQA tasks has seen an increased attention in the research community recently. We propose a visual capsule module with a query-based selection mechanism of capsule features. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task.
arXiv Detail & Related papers (2021-05-11T07:45:32Z)
End-to-End Video Question-Answer Generation with Generator-Pretester Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia. As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG) We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.