V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts
- URL: http://arxiv.org/abs/2503.02063v2
- Date: Fri, 14 Mar 2025 12:29:29 GMT
- Title: V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts
- Authors: Adnen Abdessaied, Anna Rohrbach, Marcus Rohrbach, Andreas Bulling,
- Abstract summary: V$2$Dial is a novel expert-based model geared towards simultaneously handling image and video input data for multimodal conversational tasks.<n>We propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos.
- Score: 44.33388344586592
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present V$^2$Dial - a novel expert-based model specifically geared towards simultaneously handling image and video input data for multimodal conversational tasks. Current multimodal models primarily focus on simpler tasks (e.g., VQA, VideoQA, video-text retrieval) and often neglect the more challenging conversational counterparts, such as video and visual/image dialog. Moreover, works on both conversational tasks evolved separately from each other despite their apparent similarities limiting their applicability potential. To this end, we propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos by routing them through dedicated experts and aligns them using matching and contrastive learning techniques. Furthermore, we systemically study the domain shift between the two tasks by investigating whether and to what extent these seemingly related tasks can mutually benefit from their respective training data. Extensive evaluations on the widely used video and visual dialog datasets of AVSD and VisDial show that our model achieves new state-of-the-art results across four benchmarks both in zero-shot and fine-tuning settings.
Related papers
- UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models [23.044366104080822]
We introduce textbfUniRS, the first vision-language model bftextremote bftextsensing tasks across various types of visual input.<n>UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis.<n> Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks.
arXiv Detail & Related papers (2024-12-30T06:34:18Z) - Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network [57.72095897427665]
temporal sentence grounding (TSG) aims to locate query-relevant segments in videos.
Previous methods follow a single-thread framework that cannot co-train different pairs.
We propose Multi-Pair TSG, which aims to co-train these pairs.
arXiv Detail & Related papers (2024-12-20T08:50:11Z) - UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding [55.65727739645824]
Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
arXiv Detail & Related papers (2023-11-14T10:11:36Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.