Related papers: RTQ: Rethinking Video-language Understanding Based on Image-text Model

RTQ: Rethinking Video-language Understanding Based on Image-text Model

URL: http://arxiv.org/abs/2312.00347v2
Date: Mon, 18 Dec 2023 04:59:01 GMT
Title: RTQ: Rethinking Video-language Understanding Based on Image-text Model
Authors: Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, and Liqiang Nie
Abstract summary: Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details. We propose a novel framework called RTQ, which addresses these challenges simultaneously. Our model demonstrates outstanding performance even in the absence of video-language pre-training.
Score: 55.278942477715084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.

Related papers

REVEAL: Relation-based Video Representation Learning for Video-Question-Answering [14.867263291053968]
We propose RElation-based rEpresentAtion Learning (REVEAL) to capture visual relation information. Inspired by bytemporal scene graphs, we encode video sequences as sets of relation triplets in the form of (subjectit-predicate-object) over time via their language embeddings. We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA.
arXiv Detail & Related papers (2025-04-07T19:54:04Z)
Admitting Ignorance Helps the Video Question Answering Models to Answer [82.22149677979189]
We argue that models often establish shortcuts, resulting in spurious correlations between questions and answers. We propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness.
arXiv Detail & Related papers (2025-01-15T12:44:52Z)
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z)
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models. Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z)
Semantically Consistent Video Inpainting with Conditional Diffusion Models [16.42354856518832]
We present a framework for solving problems with conditional video diffusion models. We introduce inpainting-specific sampling schemes which capture crucial long-range dependencies in the context. We devise a novel method for conditioning on the known pixels in incomplete frames.
arXiv Detail & Related papers (2024-04-30T23:49:26Z)
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z)
Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis. We characterize the limitations and potential of current video-language benchmarks. We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z)
Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives. We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z)
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space. We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z)
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query. Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs. CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.