Related papers: Multi-Modal Video Dialog State Tracking in the Wild

Multi-Modal Video Dialog State Tracking in the Wild

URL: http://arxiv.org/abs/2407.02218v2
Date: Fri, 5 Jul 2024 08:55:00 GMT
Title: Multi-Modal Video Dialog State Tracking in the Wild
Authors: Adnen Abdessaied, Lei Shi, Andreas Bulling,
Abstract summary: MST-MIXER is a novel video dialog model operating over a generic multi-modal state tracking scheme. It predicts the missing underlying structure of the selected constituents of each input modality using a novel multi-modal graph structure learning method. It achieves new state-of-the-art results on five challenging benchmarks.
Score: 10.453212911612866
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present MST-MIXER - a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.

Related papers

Towards Universal Modal Tracking with Online Dense Temporal Token Learning [66.83607018706519]
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning.<n>We expand the model's inputs to a video sequence level, aiming to see a richer video context from a near-global perspective.
arXiv Detail & Related papers (2025-07-27T08:47:42Z)
Unified People Tracking with Graph Neural Networks [39.22185669123208]
We present a unified, fully differentiable model for multi-people tracking that learns to associate detections into trajectories.<n>The model builds a dynamic graph that aggregates spatial, contextual, and temporal information.<n>We also introduce a new scale dataset with 25 partially overlapping views, detailed scene reconstructions, and extensive occlusions.
arXiv Detail & Related papers (2025-07-11T11:17:25Z)
GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation [13.071227081328288]
Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues.<n>In this paper, we propose GAME, a Graph-Augmented Multimodalvolution are designed to robustly model and fuse multi-source features for automatic personality prediction.<n>For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Net-works (CNNs)<n>To capture temporal dynamics, frame-level features are processed by a BiG
arXiv Detail & Related papers (2025-05-05T13:48:09Z)
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [49.45034796115852]
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment. Current datasets fall short in scale, realism and do not capture the nature of OR scenes, limiting multimodal in OR modeling. We introduce MM-OR, a realistic and large-scale multimodal OR dataset, and first dataset to enable multimodal scene graph generation.
arXiv Detail & Related papers (2025-03-04T13:00:52Z)
Dual-level Mixup for Graph Few-shot Learning with Fewer Tasks [23.07584018576066]
We propose a SiMple yet effectIve approach for graph few-shot Learning with fEwer tasks, named SMILE. We introduce a dual-level mixup strategy, encompassing both within-task and across-task mixup, to simultaneously enrich the available nodes and tasks in meta-learning. Empirically, SMILE consistently outperforms other competitive models by a large margin across all evaluated datasets with in-domain and cross-domain settings.
arXiv Detail & Related papers (2025-02-19T23:59:05Z)
UniGraph2: Learning a Unified Embedding Space to Bind Multimodal Graphs [34.48393396390799]
We propose a novel cross-domain graph foundation model that enables general representation learning on multimodal graphs. UniGraph2 employs modality-specific encoders alongside a graph neural network (GNN) to learn a unified low-dimensional embedding space. We show that UniGraph2 significantly outperforms state-of-the-art models in tasks such as representation learning, transfer learning, and multimodal generative tasks.
arXiv Detail & Related papers (2025-02-02T14:04:53Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing. It is designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens. MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z)
Multimodal Graph Benchmark [36.75510196380185]
Multimodal Graph Benchmark (MM-GRAPH) is first comprehensive multi-modal graph benchmark that incorporates both textual and visual information. MM-GRAPH consists of five graph learning datasets of various scales that are appropriate for different learning tasks. MM-GRAPH aims to foster research on multimodal graph learning and drive the development of more advanced and robust graph learning algorithms.
arXiv Detail & Related papers (2024-06-24T05:14:09Z)
MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion [51.80447197290866]
We introduce MyGO to process, fuse, and augment the fine-grained modality information from MMKGs. MyGO tokenizes multi-modal raw data as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder. Experiments on standard MMKGC benchmarks reveal that our method surpasses 20 of the latest models.
arXiv Detail & Related papers (2024-04-15T05:40:41Z)
NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities. We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
Unified and Dynamic Graph for Temporal Character Grouping in Long Videos [31.192044026127032]
Video temporal character grouping locates appearing moments of major characters within a video according to their identities. Recent works have evolved from unsupervised clustering to graph-based supervised clustering. We present a unified and dynamic graph (UniDG) framework for temporal character grouping.
arXiv Detail & Related papers (2023-08-27T13:22:55Z)
From random-walks to graph-sprints: a low-latency node embedding framework on continuous-time dynamic graphs [4.372841335228306]
We propose a framework for continuous-time-dynamic-graphs (CTDGs) that has low latency and is competitive with state-of-the-art, higher latency models. In our framework, time-aware node embeddings summarizing multi-hop information are computed using only single-hop operations on the incoming edges. We demonstrate that our graph-sprints features, combined with a machine learning, achieve competitive performance.
arXiv Detail & Related papers (2023-07-17T12:25:52Z)
Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z)
TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion. To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z)
Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion [28.077474663199062]
We propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. Our graph-based model reaches state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2020-11-27T06:12:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.