Leveraging Foundation Models for Multimodal Graph-Based Action Recognition
- URL: http://arxiv.org/abs/2505.15192v1
- Date: Wed, 21 May 2025 07:15:14 GMT
- Title: Leveraging Foundation Models for Multimodal Graph-Based Action Recognition
- Authors: Fatemeh Ziaeetabar, Florentin Wörgötter,
- Abstract summary: We introduce a graph-based framework that integrates a vision-temporal foundation leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding.<n>We show that our method consistently outperforms state-of-the-art baselines on diverse benchmark datasets.
- Score: 1.533133219129073
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.
Related papers
- Graph-Based Multimodal Contrastive Learning for Chart Question Answering [11.828192162922436]
This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures.<n>The framework integrates both visual and textual graphs to capture structural and semantic characteristics.<n>A graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts.
arXiv Detail & Related papers (2025-01-08T06:27:07Z) - DynaGRAG | Exploring the Topology of Information for Advancing Language Understanding and Generation in Graph Retrieval-Augmented Generation [0.0]
A novel GRAG framework, Dynamic Graph Retrieval-Agumented Generation (DynaGRAG), is proposed to focus on enhancing subgraph representation and diversity within the knowledge graph.<n> Experimental results demonstrate the effectiveness of DynaGRAG, showcasing the significance of enhanced subgraph representation and diversity for improved language understanding and generation.
arXiv Detail & Related papers (2024-12-24T16:06:53Z) - Towards Graph Foundation Models: Training on Knowledge Graphs Enables Transferability to General Graphs [26.477872205199667]
We introduce SCR, a unified graph reasoning framework designed to train on knowledge graphs.<n>We propose semantic-conditioned message passing, a novel mechanism addressing the inherent semantic isolation in traditional KG reasoning.<n>Our results show substantial performance gains over existing foundation models.
arXiv Detail & Related papers (2024-10-16T14:26:08Z) - Retrieval Augmented Generation for Dynamic Graph Modeling [15.09162213134372]
We propose a novel framework, Retrieval-Augmented Generation for Dynamic Graph modeling (RAG4DyG)<n>RAG4DyG enhances dynamic graph predictions by incorporating contextually and temporally relevant examples from broader graph structures.<n>The proposed framework is designed to be effective in both transductive and inductive scenarios.
arXiv Detail & Related papers (2024-08-26T09:23:35Z) - Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Motif-based Graph Representation Learning with Application to Chemical
Molecules [11.257235936629689]
Existing graph neural networks offer limited ability to capture complex interactions within local structural contexts.
We propose a new motif-based graph representation learning technique to better utilize local structural information.
MCM builds a motif vocabulary in an unsupervised way and deploys a novel motif convolution operation to extract the local structural context.
arXiv Detail & Related papers (2022-08-09T03:37:37Z) - TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion.
To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Dynamic Emotion Modeling with Learnable Graphs and Graph Inception
Network [0.0]
We present the Learnable Graph Inception Network (L-GrIN) that jointly learns to recognize emotion and to identify the underlying graph structure in the dynamic data.
We evaluate the proposed architecture on five benchmark emotion recognition databases spanning three different modalities.
arXiv Detail & Related papers (2020-08-06T13:51:31Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.